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Preface 



The origins of this book lie in our earlier book Random Processes: A Math- 
ematical Approach for Engineers, Prentice Hall, 1986. This book began as 
a second edition to the earlier book and the basic goal remains unchanged 
— to introduce the fundamental ideas and mechanics of random processes 
to engineers in a way that accurately reflects the underlying mathematics, 
but does not require an extensive mathematical background and does not 
belabor detailed general proofs when simple cases suffice to get the basic 
ideas across. In the thirteen years since the original book was published, 
however, numerous improvements in the presentation of the material have 
been suggested by colleagues, students, teaching assistants, and by our own 
teaching experience. The emphasis of the class shifted increasingly towards 
examples and a viewpoint that better reflected the course title: An Intro- 
duction to Statistical Signal Processing. Much of the basic content of this 
course and of the fundamentals of random processes can be viewed as the 
analysis of statistical signal processing systems: typically one is given a 
probabilistic description for one random object, which can be considered 
as an input signal. An operation or mapping or Altering is applied to the 
input signal {signal processing) to produce a new random object, the out- 
put signal. Fundamental issues include the nature of the basic probabilistic 
description and the derivation of the probabilistic description of the output 
signal given that of the input signal and a description of the particular oper- 
ation performed. A perusal of the literature in statistical signal processing, 
communications, control, image and video processing, speech and audio 
processing, medical signal processing, geophysical signal processing, and 
classical statistical areas of time series analysis, classification and regres- 
sion, and pattern recognition show a wide variety of probabilistic models for 
input processes and for operations on those processes, where the operations 
might be deterministic or random, natural or artificial, linear or nonlinear, 
digital or analog, or beneficial or harmful. An introductory course focuses 
on the fundamentals underlying the analysis of such systems: the theories 
of probability, random processes, systems, and signal processing. 
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When the original book went out of print, the time seemed ripe to 
convert the manuscript from the prehistoric troff to I^TgX and to undertake 
a serious revision of the book in the process. As the revision became more 
extensive, the title changed to match the course name and content. We 
reprint the original preface to provide some of the original motivation for 
the book, and then close this preface with a description of the goals sought 
during the revisions. 



Preface to Random Processes: An Introduction for 
Engineers 

Nothing in nature is random ... A thing appears random 
only through the incompleteness of our knowledge. — Spinoza, 

Ethics I 

I do not believe that God rolls dice. — attributed to Einstein 

Laplace argued to the effect that given complete knowledge of the physics 
of an experiment, the outcome must always be predictable. This metaphys- 
ical argument must be tempered with several facts. The relevant param- 
eters may not be measurable with sufficient precision due to mechanical 
or theoretical limits. For example, the uncertainty principle prevents the 
simultaneous accurate knowledge of both position and momentum. The 
deterministic functions may be too complex to compute in finite time. The 
computer itself may make errors due to power failures, lightning, or the 
general perfidy of inanimate objects. The experiment could take place in 
a remote location with the parameters unknown to the observer; for ex- 
ample, in a communication link, the transmitted message is unknown a 
priori, for if it were not, there would be no need for communication. The 
results of the experiment could be reported by an unreliable witness — 
either incompetent or dishonest. For these and other reasons, it is useful 
to have a theory for the analysis and synthesis of processes that behave in 
a random or unpredictable manner. The goal is to construct mathematical 
models that lead to reasonably accurate prediction of the long-term average 
behavior of random processes. The theory should produce good estimates 
of the average behavior of real processes and thereby correct theoretical 
derivations with measurable results. 

In this book we attempt a development of the basic theory and ap- 
plications of random processes that uses the language and viewpoint of 
rigorous mathematical treatments of the subject but which requires only a 
typical bachelor’s degree level of electrical engineering education including 
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elementary discrete and continuous time linear systems theory, elementary 
probability, and transform theory and applications. Detailed proofs are 
presented only when within the scope of this background. These simple 
proofs, however, often provide the groundwork for “handwaving” justifi- 
cations of more general and complicated results that are semi-rigorous in 
that they can be made rigorous by the appropriate delta-epsilontics of real 
analysis or measure theory. A primary goal of this approach is thus to use 
intuitive arguments that accurately reflect the underlying mathematics and 
which will hold up under scrutiny if the student continues to more advanced 
courses. Another goal is to enable the student who might not continue to 
more advanced courses to be able to read and generally follow the modern 
literature on applications of random processes to information and commu- 
nication theory, estimation and detection, control, signal processing, and 
stochastic systems theory. 



Revision 

The most recent (summer 1999) revision fixed numerous typos reported 
during the previous year and added quite a bit of material on jointly Gaus- 
sian vectors in Chapters 3 and 4 and on minimum mean squared error 
estimation of vectors in Chapter 4. 

This revision is a work in progress. Revised versions will be made avail- 
able through the World Wide Web page 

http : / /www-isl . Stanford . edu/~gray/ sp . html 
The material is copyrighted by the authors, but is freely available to any 
who wish to use it provided only that the contents of the entire text remain 
intact and together. A copyright release form is available for printing the 
book at the Web page. Comments, corrections, and suggestions should be 
sent to rmgray@stanford.edu. Every effort will be made to fix typos and 
take suggestions into an account on at least an annual basis. 

I hope to put together a revised solutions manual when time permits, 
but time has not permitted during the past year. 
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Glossary 



{ } a collection of points satisfying some property, e.g., {r : r < o} is the 
collection of all real numbers less than or equal to a value a 

[ ] an interval of real points including the end points, e.g., for a < b 
[a,h] = {r a < r < b}. Called a closed interval. 

( ) an interval of real points excluding the end points, e.g., for a < b 
(a,b) = {r : a < r < 6}. Called an open interval. . Note this is empty if 
a = b. 

( ], [ ) denote intervals of real points including one endpoint and exclud- 
ing the other, e.g., for a < b {a,b] = {r : a < r < b}, [a,b) = {r : a < r < b}. 

0 The empty set, the set that contains no points. 

n The sample space or universal set, the set that contains all of the 
points. 

T Sigma-field or event space 

P probability measure 

Px distribution of a random variable or vector X 

px probability mass function (pmf) of a random variable X 

fx probability density function (pdf) of a random variable X 

Fx cumulative distribution function (cdf) of a random variable X 
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Glossary 



E{X) expectation of a random variable X 

Mx{ju) characteristic function of a random variable X 
1 ^( 2 ;) indicator function of a set F 

<I) Phi function (Eq. (2.78)) 

Q Complementary Phi function (Eq. (2.79)) 




Chapter 1 

Introduction 



A random or stochastic process is a mathematical model for a phenomenon 
that evolves in time in an unpredictable manner from the viewpoint of the 
observer. The phenomenon may be a sequence of real-valued measurements 
of voltage or temperature, a binary data stream from a computer, a mod- 
ulated binary data stream from a modem, a sequence of coin tosses, the 
daily Dow- Jones average, radiometer data or photographs from deep space 
probes, a sequence of images from a cable television, or any of an infinite 
number of possible sequences, waveforms, or signals of any imaginable type. 
It may be unpredictable due to such effects as interference or noise in a com- 
munication link or storage medium, or it may be an information-bearing 
signal-deterministic from the viewpoint of an observer at the transmitter 
but random to an observer at the receiver. 

The theory of random processes quantifies the above notions so that 
one can construct mathematical models of real phenomena that are both 
tractable and meaningful in the sense of yielding useful predictions of fu- 
ture behavior. Tractability is required in order for the engineer (or anyone 
else) to be able to perform analyses and syntheses of random processes, 
perhaps with the aid of computers. The “meaningful” requirement is that 
the models provide a reasonably good approximation of the actual phe- 
nomena. An oversimplified model may provide results and conclusions that 
do not apply to the real phenomenon being modeled. An overcomplicated 
one may constrain potential applications, render theory too difficult to be 
useful, and strain available computational resources. Perhaps the most dis- 
tinguishing characteristic between an average engineer and an outstanding 
engineer is the ability to derive effective models providing a good balance 
between complexity and accuracy. 

Random processes usually occur in applications in the context of envi- 
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ronments or systems which change the processes to produce other processes. 
The intentional operation on a signal produced by one process, an “input 
signal,” to produce a new signal, an “output signal,” is generally referred 
to as signal processing, a topic easily illustrated by examples. 

• A time varying voltage waveform is produced by a human speaking 
into a microphone or telephone. This signal can be modeled by a 
random process. This signal might be modulated for transmission, 
it might be digitized and coded for transmission on a digital link, 
noise in the digital link can cause errors in reconstructed bits, the 
bits can then be used to reconstruct the original signal within some 
fidelity. All of these operations on signals can be considered as signal 
processing, although the name is most commonly used for the man- 
made operations such as modulation, digitization, and coding, rather 
than the natural possibly unavoidable changes such as the addition 
of thermal noise or other changes out of our control. 

• For very low bit rate digital speech communication applications, the 
speech is sometimes converted into a model consisting of a simple 
linear filter (called an autoregressive filter) and an input process. The 
idea is that the parameters describing the model can be communicated 
with fewer bits than can the original signal, but the receiver can 
synthesize the human voice at the other end using the model so that 
it sounds very much like the original signal. 

• Signals including image data transmitted from remote spacecraft are 
virtually buried in noise added to them on route and in the front 
end amplifiers of the powerful receivers used to retrieve the signals. 
By suitably preparing the signals prior to transmission, by suitable 
filtering of the received signal plus noise, and by suitable decision or 
estimation rules, high quality images have been transmitted through 
this very poor channel. 

• Signals produced by biomedical measuring devices can display spe- 
cific behavior when a patient suddenly changes for the worse. Signal 
processing systems can look for these changes and warn medical per- 
sonnel when suspicious behavior occurs. 

How are these signals characterized? If the signals are random, how 
does one find stable behavior or structure to describe the processes? How 
do operations on these signals change them? How can one use observations 
based on random signals to make intelligent decisions regarding future be- 
havior? All of these questions lead to aspects of the theory and application 
of random processes. 
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Courses and texts on random processes usually fall into either of two 
general and distinct categories. One category is the common engineering 
approach, which involves fairly elementary probability theory, standard un- 
dergraduate Riemann calculus, and a large dose of “cookbook” formulas — 
often with insufficient attention paid to conditions under which the formu- 
las are valid. The results are often justified by nonrigorous and occasionally 
mathematically inaccurate handwaving or intuitive plausibility arguments 
that may not reflect the actual underlying mathematical structure and may 
not be supportable by a precise proof. While intuitive arguments can be 
extremely valuable in providing insight into deep theoretical results, they 
can be a handicap if they do not capture the essence of a rigorous proof. 

A development of random processes that is insufficiently mathematical 
leaves the student ill prepared to generalize the techniques and results when 
faced with a real-world example not covered in the text. For example, if 
one is faced with the problem of designing signal processing equipment for 
predicting or communicating measurements being made for the first time 
by a space probe, how does one construct a mathematical model for the 
physical process that will be useful for analysis? If one encounters a process 
that is neither stationary nor ergodic, what techniques still apply? Can the 
law of large numbers still be used to construct a useful model? 

An additional problem with an insufficiently mathematical development 
is that it does not leave the student adequately prepared to read modern 
literature such as the many Transactions of the IEEE. The more advanced 
mathematical language of recent work is increasingly used even in simple 
cases because it is precise and universal and focuses on the structure com- 
mon to all random processes. Even if an engineer is not directly involved 
in research, knowledge of the current literature can often provide useful 
ideas and techniques for tackling specific problems. Engineers unfamiliar 
with basic concepts such as sigma-field and conditional expectation will And 
many potentially valuable references shrouded in mystery. 

The other category of courses and texts on random processes is the 
typical mathematical approach, which requires an advanced mathemati- 
cal background of real analysis, measure theory, and integration theory; 
it involves precise and careful theorem statements and proofs, and it is 
far more careful to specify precisely the conditions required for a result 
to hold. Most engineers do not, however, have the required mathematical 
background, and the extra care required in a completely rigorous develop- 
ment severely limits the number of topics that can be covered in a typical 
course — in particular, the applications that are so important to engineers 
tend to be neglected. In addition, too much time can be spent with the 
formal details, obscuring the often simple and elegant ideas behind a proof. 
Often little, if any, physical motivation for the topics is given. 
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This book attempts a compromise between the two approaches by giving 
the basic, elementary theory and a profusion of examples in the language 
and notation of the more advanced mathematical approaches. The intent 
is to make the crucial concepts clear in the traditional elementary cases, 
such as coin flipping, and thereby to emphasize the mathematical structure 
of all random processes in the simplest possible context. The structure is 
then further developed by numerous increasingly complex examples of ran- 
dom processes that have proved useful in stochastic systems analysis. The 
complicated examples are constructed from the simple examples by signal 
processing, that is, by using a simple process as an input to a system whose 
output is the more complicated process. This has the double advantage 
of describing the action of the system, the actual signal processing, and 
the interesting random process which is thereby produced. As one might 
suspect, signal processing can be used to produce simple processes from 
complicated ones. 

Careful proofs are constructed only in elementary cases. For example, 
the fundamental theorem of expectation is proved only for discrete random 
variables, where it is proved simply by a change of variables in a sum. 
The continuous analog is subsequently given without a careful proof, but 
with the explanation that it is simply the integral analog of the summation 
formula and hence can be viewed as a limiting form of the discrete result. 
As another example, only weak laws of large numbers are proved in detail 
in the mainstream of the text, but the stronger laws are at least stated and 
they are discussed in some detail in starred sections. 

By these means we strive to capture the spirit of important proofs with- 
out undue tedium and to make plausible the required assumptions and con- 
straints. This, in turn, should aid the student in determining when certain 
tools do or do not apply and what additional tools might be necessary when 
new generalizations are required. 

A distinct aspect of the mathematical viewpoint is the “grand exper- 
iment” view of random processes as being a probability measure on se- 
quences (for discrete time) or waveforms (for continuous time) rather than 
being an infinity of smaller experiments representing individual outcomes 
(called random variables) that are somehow glued together. From this point 
of view random variables are merely special cases of random processes. In 
fact, the grand experiment viewpoint was popular in the early days of ap- 
plications of random processes to systems and was called the “ensemble” 
viewpoint in the work of Norbert Wiener and his students. By viewing the 
random process as a whole instead of as a collection of pieces, many basic 
ideas, such as stationarity and ergodicity, that characterize the dependence 
on time of probabilistic descriptions and the relation between time averages 
and probabilistic averages are much easier to define and study. This also 
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permits a more complete discussion of processes that violate such proba- 
bilistic regularity requirements yet still have useful relations between time 
and probabilistic averages. 

Even though a student completing this book will not be able to fol- 
low the details in the literature of many proofs of results involving random 
processes, the basic results and their development and implications should 
be accessible, and the most common examples of random processes and 
classes of random processes should be familiar. In particular, the student 
should be well equipped to follow the gist of most arguments in the vari- 
ous Transactions of the IEEE dealing with random processes, including the 
IEEE Transactions on Signal Processing, IEEE Transactions on Image Pro- 
cessing, IEEE Transactions on Speech and Audio Processing, IEEE Trans- 
actions on Communications, IEEE Transactions on Control, and IEEE 
Transactions on Information Theory. 

It also should be mentioned that the authors are electrical engineers 
and, as such, have written this text with an electrical engineering flavor. 
However, the required knowledge of classical electrical engineering is slight, 
and engineers in other fields should be able to follow the material presented. 

This book is intended to provide a one-quarter or one-semester course 
that develops the basic ideas and language of the theory of random pro- 
cesses and provides a rich collection of examples of commonly encountered 
processes, properties, and calculations. Although in some cases these ex- 
amples may seem somewhat artificial, they are chosen to illustrate the way 
engineers should think about random processes and for simplicity and con- 
ceptual content rather than to present the method of solution to some 
particular application. Sections that can he skimmed or omitted for the 
shorter one-quarter curriculum are marked with a star (T). Discrete time 
processes are given more emphasis than in many texts because they are 
simpler to handle and because they are of increasing practical importance 
in and digital systems. For example, linear filter input/output relations are 
carefully developed for discrete time and then the continuous time analogs 
are obtained by replacing sums with integrals. 

Most examples are developed by beginning with simple processes and 
then filtering or modulating them to obtain more complicated processes. 
This provides many examples of typical probabilistic computations and 
output of operations on simple processes. Extra tools are introduced as 
needed to develop properties of the examples. 

The prerequisites for this book are elementary set theory, elementary 
probability, and some familiarity with linear systems theory (Fourier anal- 
ysis, convolution, discrete and continuous time linear filters, and transfer 
functions). The elementary set theory and probability may be found, for ex- 
ample, in the classic text by A1 Drake [12]. The Fourier and linear systems 
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material can by found, for example, in Gray and Goodman [23]. Although 
some of these basic topics are reviewed in this book in appendix A, they are 
considered prerequisite as the pace and density of material would likely be 
overwhelming to someone not already familiar with the fundamental ideas 
of probability such as probability mass and density functions (including the 
more common named distributions), computing probabilities, derived dis- 
tributions, random variables, and expectation. It has long been the authors’ 
experience that the students having the most difficulty with this material 
are those with little or no experience with elementary probability. 



Organization of the Book 

Ghapter 2 provides a careful development of the fundamental concept of 
probability theory — a probability space or experiment. The notions of 
sample space, event space, and probability measure are introduced, and 
several examples are toured. Independence and elementary conditional 
probability are developed in some detail. The ideas of signal processing 
and of random variables are introduced briefly as functions or operations 
on the output of an experiment. This in turn allows mention of the idea 
of expectation at an early stage as a generalization of the description of 
probabilities by sums or integrals. 

Ghapter 3 treats the theory of measurements made on experiments: 
random variables, which are scalar-valued measurements; random vectors, 
which are a vector or finite collection of measurements; and random pro- 
cesses, which can be viewed as sequences or waveforms of measurements. 
Random variables, vectors, and processes can all be viewed as forms of sig- 
nal processing: each operates on “inputs,” which are the sample points of 
a probability space, and produces an “output,” which is the resulting sam- 
ple value of the random variable, vector, or process. These output points 
together constitute an output sample space, which inherits its own proba- 
bility measure from the structure of the measurement and the underlying 
experiment. As a result, many of the basic properties of random variables, 
vectors, and processes follow from those of probability spaces. Probability 
distributions are introduced along with probability mass functions, proba- 
bility density functions, and cumulative distribution functions. The basic 
derived distribution method is described and demonstrated by example. A 
wide variety of examples of random variables, vectors, and processes are 
treated. 

Ghapter 4 develops in depth the ideas of expectation, averages of ran- 
dom objects with respect to probability distributions. Also called proba- 
bilistic averages, statistical averages, and ensemble averages, expectations 
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can be thought of as providing simple but important parameters describ- 
ing probability distributions. A variety of specific averages are considered, 
including mean, variance, characteristic functions, correlation, and covari- 
ance. Several examples of unconditional and conditional expectations and 
their properties and applications are provided. Perhaps the most impor- 
tant application is to the statement and proof of laws of large numbers or 
ergodic theorems, which relate long term sample average behavior of ran- 
dom processes to expectations. In this chapter laws of large numbers are 
proved for simple, but important, classes of random processes. Other im- 
portant applications of expectation arise in performing and analyzing signal 
processing applications such as detecting, classifying, and estimating data. 
Minimum mean squared nonlinear and linear estimation of scalars and vec- 
tors is treated in some detail, showing the fundamental connections among 
conditional expectation, optimal estimation, and second order moments of 
random variables and vectors. 

Chapter 5 concentrates on the computation of second-order moments — 
the mean and covariance — of a variety of random processes. The primary 
example is a form of derived distribution problem: if a given random process 
with known second-order moments is put into a linear system what are the 
second-order moments of the resulting output random process? This prob- 
lem is treated for linear systems represented by convolutions and for linear 
modulation systems. Transform techniques are shown to provide a simpli- 
fication in the computations, much like their ordinary role in elementary 
linear systems theory. The chapter closes with a development of several 
results from the theory of linear least-squares estimation. This provides 
an example of both the computation and the application of second-order 
moments. 

Chapter 6 develops a variety of useful models of sometimes complicated 
random processes. A powerful approach to modeling complicated random 
processes is to consider linear systems driven by simple random processes. 
Chapter 5 used this approach to compute second order moments, this chap- 
ter goes beyond moments to develop a complete description of the output 
processes. To accomplish this, however, one must make additional assump- 
tions on the input process and on the form of the linear filters. The general 
model of a linear filter driven by a memoryless process is used to develop 
several popular models of discrete time random processes. Analogous con- 
tinuous time random process models are then developed by direct descrip- 
tion of their behavior. The basic class of random processes considered is 
the class of independent increment processes, but other processes with sim- 
ilar definitions but quite different properties are also introduced. Among 
the models considered are autoregressive processes, moving-average pro- 
cesses, ARMA (autoregressive-moving average) processes, random walks. 
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independent increment processes, Markov processes, Poisson and Gaussian 
processes, and the random telegraph wave. We also briefly consider an ex- 
ample of a nonlinear system where the output random processes can at least 
be partially described — the exponential function of a Gaussian or Poisson 
process which models phase or frequency modulation. We close with ex- 
amples of a type of “doubly stochastic” process, compound processes made 
up by adding a random number of other random effects. 

Appendix A sketches several prerequisite definitions and concepts from 
elementary set theory and linear systems theory using examples to be en- 
countered later in the book. The first subject is crucial at an early stage 
and should be reviewed before proceeding to chapter 2. The second subject 
is not required until chapter 5, but it serves as a reminder of material with 
which the student should already be familiar. Elementary probability is not 
reviewed, as our basic development includes elementary probability. The 
review of prerequisite material in the appendix serves to collect together 
some notation and many definitions that will be used throughout the book. 
It is, however, only a brief review and cannot serve as a substitute for 
a complete course on the material. This chapter can be given as a first 
reading assignment and either skipped or skimmed briefly in class; lectures 
can proceed from an introduction, perhaps incorporating some preliminary 
material, directly to chapter 2. 

Appendix B provides some scattered definitions and results needed in 
the book that detract from the main development, but may be of interest 
for background or detail. These fall primarily in the realm of calculus and 
range from the evaluation of common sums and integrals to a consideration 
of different definitions of integration. Many of the sums and integrals should 
be prerequisite material, but it has been the authors’ experience that many 
students have either forgotten or not seen many of the standard tricks 
and hence several of the most important techniques for probability and 
signal processing applications are included. Also in this appendix some 
background information on limits of double sums and the Lebesgue integral 
is provided. 

Appendix G collects the common univariate pmf’s and pdf’s along with 
their second order moments for reference. 

The book concludes with an appendix suggesting supplementary read- 
ing, providing occasional historical notes, and delving deeper into some of 
the technical issues raised in the book. We assemble in that section refer- 
ences on additional background material as well as on books that pursue 
the various topics in more depth or on a more advanced level. We feel that 
these comments and references are supplementary to the development and 
that less clutter results by putting them in a single appendix rather than 
strewing them throughout the text. The section is intended as a guide for 
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further study, not as an exhaustive description of the relevant literature, 
the latter goal being beyond the authors’ interests and stamina. 

Each chapter is accompanied by a collection of problems, many of which 
have been contributed by collegues, readers, students, and former students. 
It is important when doing the problems to justify any “yes/no” answers. 
If an answer is “yes,” prove it is so. If the answer is “no,” provide a 
counterexample . 




CHAPTER 1. INTRODUCTION 




Chapter 2 



Probability 



2.1 Introduction 

The theory of random processes is a branch of probability theory and prob- 
ability theory is a special case of the branch of mathematics known as 
measure theory. Probability theory and measure theory both concentrate 
on functions that assign real numbers to certain sets in an abstract space 
according to certain rules. These set functions can be viewed as measures 
of the size or weight of the sets. For example, the precise notion of area 
in two-dimensional Euclidean space and volume in three-dimensional space 
are both examples of measures on sets. Other measures on sets in three 
dimensions are mass and weight. Observe that from elementary calculus 
we can find volume by integrating a constant over the set. From physics 
we can find mass by integrating a mass density or summing point masses 
over a set. In both cases the set is a region of three-dimensional space. In 
a similar manner, probabilities will be computed by integrals of densities 
of probability or sums of “point masses” of probability. 

Both probability theory and measure theory consider only nonnegative 
real-valued set functions. The value assigned by the function to a set is 
called the probability or the measure of the set, respectively. The basic 
difference between probability theory and measure theory is that the former 
considers only set functions that are normalized in the sense of assigning 
the value of 1 to the entire abstract space, corresponding to the intuition 
that the abstract space contains every possible outcome of an experiment 
and hence should happen with certainty or probability 1. Subsets of the 
space have some uncertainty and hence have probability less than 1. 

Probability theory begins with the concept of a probability space, which 
is a collection of three items: 
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1. An abstract space LI, such as encountered in appendix A, called a 
sample space, which contains all distinguishable elementary outcomes 
or results of an experiment. These points might be names, numbers, 
or complicated signals. 

2. An event space or sigma-field T consisting of a collection of subsets 
of the abstract space which we wish to consider as possible events and 
to which we wish to assign a probability. We require that the event 
space have an algebraic structure in the following sense: any finite 
or infinite sequence of set-theoretic operations (union, intersection, 
complementation, difference, symmetric difference) on events must 
produce other events, even countably infinite sequences of operations. 

3. A probability measure P — an assignment of a number between 0 and 
1 to every event, that is, to every set in the event space. A probability 
measure must obey certain rules or axioms and will be computed by 
integrating or summing, analogous to area, volume, and mass. 

This chapter is devoted to developing the ideas underlying the triple 
{Ll,T,P), which is collectively called a probability space or an experiment. 
Before making these ideas precise, however, several comments are in order. 

First of all, it should be emphasized that a probability space is composed 
of three parts; an abstract space is only one part. Do not let the terminology 
confuse you: “space” has more than one usage. Having an abstract space 
model all possible distinguishable outcomes of an experiment should be 
an intuitive idea since it is simply giving a precise mathematical name 
to an imprecise English description. Since subsets of the abstract space 
correspond to collections of elementary outcomes, it should also be possible 
to assign probabilities to such sets. It is a little harder to see, but we can 
also argue that we should focus on the sets and not on the individual points 
when assigning probabilities since in many cases a probability assignment 
known only for points will not be very useful. For example, if we spin a fair 
pointer and the outcome is known to be equally likely to be any number 
between 0 an 1, then the probability that any particular point such as 
.3781984637 or exactly I/tt occurs is 0 because there are an uncountable 
infinity of possible points, none more likely than the others^ . Hence knowing 
only that the probability of each and every point is zero, we would be hard 

^A set is countably infinite if it can be put into one-to-one correspondence 
with the nonnegative integers and hence can be counted. For example, the set of 
positive integers is countable and the set of all rational numbers is countable. The 
set of all irrational numbers and the set of all real numbers are both uncountable. 
See appendix A for a discussion of countably infinite vs. uncountably infinite 
spaces. 
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pressed to make any meaningful inferences about the probabilities of other 
events such as the outcome being between 1/2 and 3/4. Writers of fiction 
(including Patrick O’Brian in his Aubrey-Maturin series) have often made 
much of the fact that extremely unlikely events often occur. One can say 
that zero probability events occur all virtually all the time since the a priori 
probability that the universe will be exactly a particular configuration at 
12:01AM Coordinated Universal Time (aka Greenwich Mean Time) is 0, 
yet the universe will indeed be in some configuration at that time. 

The difficulty inherent in this example leads to a less natural aspect of 
the probability space triumvirate — the fact that we must specify an event 
space or collection of subsets of our abstract space to which we wish to 
assign probabilities. In the example it is clear that taking the individual 
points and their countable combinations is not enough (see also problem 
2.2). On the other hand, why not just make the event space the class of 
all subsets of the abstract space? Why require the specification of which 
subsets are to be deemed sufficiently important to be blessed with the name 
“event”? In fact, this concern is one of the principal differences between 
elementary probability theory and advanced probability theory (and the 
point at which the student’s intuition frequently runs into trouble). When 
the abstract space is finite or even countably infinite, one can consider all 
possible subsets of the space to be events, and one can build a useful theory. 
When the abstract space is uncountably infinite, however, as in the case of 
the space consisting of the real line or the unit interval, one cannot build 
a useful theory without constraining the subsets to which one will assign 
a probability. Roughly speaking, this is because probabilities of sets in 
uncountable spaces are found by integrating over sets, and some sets are 
simply too nasty to be integrated over. Although it is difficult to show, 
for such spaces there does not exist a reasonable and consistent means 
of assigning probabilities to all subsets without contradiction or without 
violating desirable properties. In fact, is is so difficult to show that such 
“non-probability-measurable” subsets of the real line exist that we will not 
attempt to do so in this book. The reader should at least be aware of the 
problem so that the need for specifying an event space is understood. It 
also explains why the reader is likely to encounter phrases like “measurable 
sets” and “measurable functions” in the literature. 

Thus a probability space must make explicit not just the elementary 
outcomes or “finest-grain” outcomes that constitute our abstract space; it 
must also specify the collections of sets of these points to which we intend 
to assign probabilities. Subsets of the abstract space that do not belong to 
the event space will simply not have probabilities defined. The algebraic 
structure that we have postulated for the event space will ensure that if 
we take (countable) unions of events (corresponding to a logical “or”) or 
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intersections of events (corresponding to a logical “and”), then the resulting 
sets are also events and hence will have probabilities. In fact, this is one of 
the main functions of probability theory: given a probabilistic description 
of a collection of events, find the probability of some new event formed by 
set-theoretic operations on the given events. 



Up to this point the notion of signal processing has not been mentioned. 
It enters at a fundamental level if one realizes that each individual point 
uj G produced in an experiment can be viewed as a signal, it might be a 
single voltage conveying the value of a measurement, a vector of values, a 
sequence of values, or a waveform, any one of which can be interpreted as a 
signal measured in the environment or received from a remote transmitter 
or extracted from a physical medium that was previously recorded. Signal 
processing in general is the performing of some operation on the signal. In 
its simplest yet most general form this consists of applying some function or 
mapping or operation g to the signal or input to to produce an output g{to), 
which might be intended to guess some hidden parameter, extract useful 
information from noise, enhance an image, or any simple or complicated 
operation intended to produce a useful outcome. If we have a probabilistic 
description of the underlying experiment, then we should be able to derive 
a probabilistic description of the outcome of the signal processor. This, in 
fact, is the core problem of derived distributions, one of the fundamental 
tools of both probability theory and signal processing. In fact, this idea of 
defining functions on probability spaces is the foundation for the definition 
of random variables, random vectors, and random processes, which will in- 
herit their basic properties from the underlying probability space, thereby 
yielding new probability spaces. Much of the theory of random processes 
and signal processing consists of developing the implications of certain oper- 
ations on probability spaces: beginning with some probability space we form 
new ones by operations called variously mappings, filtering, sampling, cod- 
ing, communicating, estimating, detecting, averaging, measuring, enhanc- 
ing, predicting, smoothing, interpolating, classifying, analyzing or other 
names denoting linear or nonlinear operations. Stochastic systems theory 
is the combination of systems theory with probability theory. The essence 
of stochastic systems theory is the connection of a system to a probability 
space. Thus a precise formulation and a good understanding of probability 
spaces are prerequisites to a precise formulation and correct development 
of examples of random processes and stochastic systems. 



Before proceeding to a careful development, several of the basic ideas 
are illustrated informally with simple examples. 
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2.2 Spinning Pointers and Flipping Coins 

Many of the basic ideas at the core of this text can be introduced and illus- 
trated by two very simple examples, the continuous experiment of spinning 
a pointer inside a circle and the discrete experiment of flipping a coin. 



A Uniform Spinning Pointer 

Suppose that Nature (or perhaps Tyche, the Greek Goddess of chance) spins 
a pointer in a circle as depicted in Figure 2.1. When the pointer stops it can 




Figure 2.1: The Spinning Pointer 



point to any number in the unit interval [0, 1) = {r : 0 < r < 1}. We call 
[0, 1) the sample space of our experiment and denote it by a capital Greek 
omega, O. What can we say about the probabilities or chances of particular 
events or outcomes occurring as a result of this experiment? The sorts of 
events of interest are things like “the pointer points to a number between 0 
and .5” (which one would expect should have probability 0.5 if the wheel is 
indeed fair) or “the pointer does not lie between 0.75 and 1” (which should 
have a probability of 0.75). Two assumptions are implicit here. The first 
is that an “outcome” of the experiment or an “event” to which we can 
assign a probability is simply a subset of [0,1). The second assumption 
is that the probability of the pointer landing in any particular interval of 
the sample space is proportional to the length of the interval. This should 
seem reasonable if we indeed believe the spinning pointer to be “fair” in the 
sense of not favoring any outcomes over any others. The bigger a region of 
the circle, the more likely the pointer is to end up in that region. We can 
formalize this by stating that for any interval [a, b] = {r : a < r < b} with 
0<a<6<lwe have that the probability of the event “the pointer lands 
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in the interval [a, &]” is 

P{[a,b]) = b — a. (2-1) 

We do not have to restrict interest to intervals in order to define probabil- 
ities consistent with (2.1). The notion of the length of an interval can be 
made precise using calculus and simultaneously extended to any subset of 
[0, 1) by defining the probability P{F) of a set F C [0, 1) as 

P{F)^ J^f{r)dr = JlF{r)f{r)dr, (2.2) 

where /(r) = 1 for all r G [0, 1). With this definition it is clear that for any 
0 < a < 6 < 1 that 

P{[a,b])= [ f{r)dr = b- a. (2.3) 

J a 

We could also arrive at effectively the same model by considering the sample 
space to be the entire real line, = 3? = (— oo, oo) and defining the pdf to 
be 



f{r) 



1 if r e [0, 1) 
0 otherwise 



(2.4) 



The integral can also be expressed without specifying limits of integration 
by using the indicator function of a set 



as 



lF(r) 



1 if r € F 

0 if r ^ F 



P{F)t J lp{r)f{r)dr. 



(2.5) 



( 2 . 6 ) 



Other implicit assumptions have been made here. The first is that 
probabilities must satisfy some consistency properties, we cannot arbitrar- 
ily define probabilities of distinct subsets of [0, 1) (or, more generally, 3?) 
without regards to the implications of probabilities for other sets; the prob- 
abilities must be consistent with each other in the sense that they do not 
contradict each other. For example, if we have two formulas for comput- 
ing probabilities of a common event, as we have with (2.1) and (2.2) for 
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computing the probability of an interval, then both formulas must give the 
same numerical result — as they do in this example. 

The second implicit assumption is that the integral exists in a well de- 
fined sense, that it can be evaluated using calculus. As surprising as it 
may seem to readers familiar only with typical engineering-oriented devel- 
opments of Riemann integration, the integral of (2.2) is in fact not well 
defined for all subsets of [0, 1). But we leave this detail for later and as- 
sume for the moment that we only encounter sets for which the integral 
(and hence the probability) is well defined. 

The function /(r) is called a probability density function or pdf since it is 
a nonnegative point function that is integrated to compute total probability 
of a set, just as a mass density function is integrated over a region to 
compute the mass of a region in physics. Since in this example f(r) is 
constant over a region, it is called a uniform pdf. 

The formula (2.2) for computing probability has many implications, 
three of which merit comment at this point. 

• Probabilities are nonnegative: 

P{F) > 0 for any F. (2.7) 

This follows since integrating a nonnegative argument yields a nonnegative 
result. 

• The probability of the entire sample space is 1: 

P{0) = 1. (2.8) 

This follows since integrating 1 over the unit interval yields 1, but it has 
the intuitive interpretation that the probability that “something happens” 
is 1. 

• The probability of the union of disjoint regions is the sum of the proba- 
bilities of the individual events: 

If P’ n G = 0 , then P{F U G) = P{F) + P{G). (2.9) 

This follows immediately from the properties of integration: 

P{F U G) = f f{r) dr 
JfuG 

= [ f{r)dr+ [ f{r)dr 

JF JG 

= P{F) + P{G). 

An alternative proof follows by observing that since F and G are disjoint. 
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^fug(i") = + Ig(^) and hence linearity of integration implies that 



P{FUG) 



J ^FuG{r)f{r)dr 
J + lGir))f{r) dr 



= J lF{r)f{r)dr + J lG(r)/(r)dr 
= P{F)+P{G). 



This property is often called the additivity property of probability. The 
second proof makes it clear that additivity of probability is an immediate 
result of the linearity of integration, i.e., that the integral of the sum of two 
functions is the sum of the two integrals. 

Repeated application of additivity for two events shows that for any 
finite collection {F^; k = 1,2,... ,K} of disjoint or mutually exclusive 
events, i.e., events with the property that Fkf^Fj = 0 for all k yf j, we 
have that 

K K 

P{\jFk) = Y.P{Fk), ( 2 . 10 ) 

k—1 k—1 

showing that additivity is equivalent to finite additivity, the similar prop- 
erty for finite sets instead of just two sets. Since additivity is a special case 
of finite additivity, the two notions are equivalent and we can use them 
interchangably. 

These three properties of nonnegativity, normalization, and additivity 
are fundamental to the definition of the general notion of probability and 
will form three of the four axioms needed for a precise development. It 
is tempting to call an assignment P of numbers to subsets of a sample 
space a probability measure if it satisfies these three properties, but we 
shall see that a fourth condition, which is crucial for having well behaved 
limits and asymptotics, will be needed to complete the definition. Pending 
this fourth condition, (2.2) defines a probability measure. A sample space 
together with a probability measure provide a mathematical model for an 
experiment. This model is often called a probability space, but for the 
moment we shall stick to the less intimidating word of experiment. 



Simple Properties 

Several simple properties of probabilities can be derived from what we have 
so far. As particularly simple, but still important, examples, consider the 
following, following. 
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Assume that P is a set function defined on a sample space 0 that satisfies 
properties (2.7 - 2.9). Then 

(a) P(P‘=) = 1 - P(P) . 

(b) P(P) < 1 . 

(c) Let 0 be the null or empty set, then P(0) = 0 . 

(d) If {Fi; i = 1,2, .. . ,K} is a finite partition of O, i.e., if Pi n Pfc = 0 

when i k and Pi = LI, then 

K 

P(G) = ^P(GnP,) (2.11) 



for any event G. 

Proof: 

(a) F U F‘^ = Lt implies P(P U P°) = 1 (property 2.8). F O F^ = (d implies 

1 = P(P U P'^) = P(P) + P{F^) (property 2.9), which implies (a). 

(b) P(P) = 1 — P{F‘^) < 1 (property 2.7 and (a) above). 

(c) By property 2.8 and (a) above, P{0‘^) = P(0) = 1 — P{Ll) = 0. 

(d) P(G) = P(G n fi) = P(G n (IJPi)) = P(U(G n Pi)) = ^p(G n p,). 

i i i 

Observe that although the null or empty set 0 has probability 0, the 
converse is not true in that a set need not be empty just because it has 
zero probability. In the uniform fair wheel example the set P = { 1/n : n = 
1,2,3,...} is not empty, but it does have probability zero. This follows 
rougly because for any finite N P{{l/n : n = 1,2,3,... ,A^|) = 0 and 
therefore the limit as N ^ oo must also be zero. 

A Single Coin Flip 

The original example of a spinning wheel is continuous in that the sample 
space consists of a continuum of possible outcomes, all points in the unit 
interval. Sample spaces can also be discrete, as is the case of modeling 
a single flip of a “fair” coin with heads labeled “1” and tails labeled “0”, 
i.e., heads and tails are equally likely. The sample space in this example is 
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Lt = {0, 1} and the probability for any event or subset of u> can be defined 
in a reasonable way by 



P{F) = Y,p{r), ( 2 . 12 ) 

reF 

or, equivalently, 

P{F)^J2^p{r)p{r), (2.13) 

where now p{r) = 1/2 for each r € 12. The function p is called a proba- 
bility mass function or pmf because it is summed over points to find total 
probability, just as point masses are summed to find total mass in physics. 
Be cautioned that P is defined for sets and p is defined only for points in 
the sample space. This can be confusing when dealing with one-point or 
singleton sets, for example 



P({0}) = p{0) 

^({1}) = P(l). 



This may seem too much work for such a little example, but keep in mind 
that the goal is a formulation that will work for far more complicated and 
interesting examples. This example is different from the spinning wheel 
in that the sample space is discrete instead of continuous and that the 
probabilities of events are defined by sums instead of integrals, as one should 
expect when doing discrete math. It is easy to verify, however, that the 
basic properties (2.7)-(2.9) hold in this case as well (since sums behave like 
integrals), which in turn implies that the simple properties (a)— (b) also 
hold. 

A Single Coin Flip as Signal Processing 

The coin flip example can also be derived in a very different way that pro- 
vides our first example of signal processing. Consider again the spinning 
pointer so that the sample space is 12 and the probability measure P is de- 
scribed by (2.2) using a uniform pdf as in (2.4). Performing the experiment 
by spinning the pointer will yield some real number r G [0, 1). Define a 
measurement q made on this outcome by 



q{r) = 



1 if rG [0,0.5] 
0 if rG (0.5,1) 



(2.14) 
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This function can also be defined somewhat more economically as 

<?(?■) = 1 [0,0.5] (»’)■ (2.15) 

This is an example of a quantizer, an operation that maps a continuous 
value into a discrete one. Quantization is an example of signal processing 
since it is a function or mapping defined on an input space, here 0 = [0, 1) 
or n = 5ft, producing a value in some output space, here a binary space 
Llg = {0,1}. The dependence of a function on its input space or domain 
of definition LI and its output space or range Llg,is often denoted by q : 
n — > fig. Although introduced as an example of simple signal processing, 
the usual name for a real-valued function defined on the sample space of 
a probability space is a random variable. We shall see in the next chapter 
that there is an extra technical condition on functions to merit this name, 
but that is a detail that can be postponed. 

The output space Llg can be considered as a new sample space, the space 
corresponding to the possible values seen by an observer of the output of the 
quantizer (an observer who might not have access to the original space). If 
we know both the probability measure on the input space and the function, 
then in theory we should be able to describe the probability measure that 
the output space inherits from the input space. Since the output space is 
discrete, it should be described by a pmf, say Pg. Since there are only two 
points, we need only find the value of pg(l) (orpg(O) since pg(0)-|-pg(l) = 1). 
On output of 1 is seen if and only if the input sample point lies in [0,0.5], 
so it follows easily that Pq{0) = ^’([0, 0.5]) = f{r), dr = 0.5, exactly the 

value assumed for the fair coin flip model. The pmf pq implies a probability 
measure on the output space Llg by 

where the subscript q distinguishes the probability measure Pq on the out- 
put space from the probability measure P on the input space. Note that 
we can define any other binary quantizer corresponding to an “unfair” or 
biased coin by changing the 0.5 to some other value. 

This simple example makes several fundamental points that will evolve 
in depth in the course of this material. First, it provides an example of 
signal processing and the first example of a random variable, which is essen- 
tially just a mapping of one sample space into another. Second, it provides 
an example of a derived distribution: given a probability space described 
by LI and P and a function (random variable) q defined on this space, we 
have derived a new probability space describing the outputs of the function 
with sample space Llq and probability measure Pq. Third, it is an example 
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of a common phenomenon that quite different models can result in iden- 
tical sample spaces and probability measures. Here the coin flip could be 
modeled in a directly given fashion by just describing the sample space 
and the probability measure, or it can be modeled in an indirect fashion 
as a function (signal processing, random variable) on another experiment. 
This suggests, for example, that to study coin flips empirically we could 
either actually flip a fair coin, or we could spin a fair wheel and quantize 
the output. Although the second method seems more complicated, it is in 
fact extremely common since most random number generators (or pseudo- 
random number generators) strive to produce random numbers with a uni- 
form distribution on [0, 1) and all other probability measures are produced 
by further signal processing. We have seen how to do this for a simple coin 
flip. In fact any pdf or pmf can be generated in this way. (See problem 3.7.) 
The generation of uniform random numbers is both a science and an art. 
Most function roughly as follows. One begins with floating point number 
in (0, 1) called the seed, say a, and uses another postive floating point num- 
ber, say b, as a multiplier. A sequence is then generated recursively as 
xq = a and x„ = b x Xn — I mod (1) for n = 1, 2, . . . , that is, the fractional 
part of & X — 1. If the two numbers a and b are suitably chosen then 
Xn should appear to be uniform. (Try it!) In fact, since there are only 
a finite number (albeit large) of possible numbers that can be represented 
on a digital computer, this algorithm must eventually repeat and hence Xn 
must be a periodic sequence. The goal of designing a good pseudo-random 
number generater is to make the period as long as possible and to make 
the sequences produced look as much as possible like a random sequence in 
the sense that statistical tests for independence are fooled. 

Abstract vs. Concrete 

It may seem strange that the axioms of probability deal with apparently 
abstract ideas of measures instead of corresponding physical intuition that 
the probability tells you something about the fraction of times specific 
events will occur in a sequence of trials, such as the relative frequency of 
a pair of dice summing to seven in a sequence of many roles, or a decision 
algorithm correctly detecting a single binary symbol in the presence of noise 
in a transmitted data file. Such real world behavior can be quantified by 
the idea of a relative frequency, that is, suppose the output of the nth of a 
sequence of trials is x„ and we wish to know the relative frequency that Xn 
takes on a particular value, say a. Then given an infinite sequence of trials 
X = {xq, xi,X2, . . . } we could define the relative frequency of a in a; by 

, , number of k € {0, 1, . . . , n — 1} for which Xk = a 

ra{x) = lim . (2.16) 

n—*oo Ji 
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For example, the relative frequency of heads in an infinite sequence of fair 
coin flips should be 0.5, the relative frequency of rolling a pair of fair dice 
and having the sum be 7 in an infinite sequence of rolls should be 1/6 since 
the pairs (1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3) are equally likely and form 
6 of the possible 36 pairs of outcomes. Thus one might suspect that to 
make a rigorous theory of probability requires only a rigorous definition 
of probabilities as such limits and a reaping of the resulting benefits. In 
fact much of the history of theoretical probability consisted of attempts to 
accomplish this, but unfortunately it does not work. Such limits might not 
exist, or they might exist and not converge to the same thing for different 
repetitions of the same experiment. Even when the limits do exist there 
is no guarantee they will behave as intuition would suggest when one tries 
to do calculus with probabilities, to compute probabilities of complicated 
events from those of simple related events. Attempts to get around these 
problems uniformly failed and probability was not put on a rigorous basis 
until the axiomatic approach was completed by Kolmogorov. The axioms 
do, however, capture certain intuitive aspects of relative frequencies. Rel- 
ative frequencies are nonnegative, the relative frequency of the entire set 
of possible outcomes is one, and relative frequencies are additive in the 
sense that the relative frequency of the symbol a or the symbol b occurring, 
raL>b{x), is clearly ra{x) + rb(x). Kolmogorov realized that beginning with 
simple axioms could lead to rigorous limiting results of the type needed, 
while there was no way to begin with the limiting results as part of the 
axioms. In fact it is the fourth axiom, a limiting version of additivity, that 
plays the key role in making the asymptotics work. 



2.3 Probability Spaces 

We now turn to a more thorough development of the ideas introduced in 
the previous section. 

A sample space LI is an abstract space, a nonempty collection of points 
or members or elements called sample points (or elementary events or ele- 
mentary outcomes). 

An event space (or sigma-field or sigma-algebra) IF of a sample space 
n is a nonempty collection of subsets of Lt called events with the following 
properties: 



li F &T , then also F‘^ € F , (2.17) 

that is, if a given set is an event, then its complement must also be an 
event. Note that any particular subset of LI may or may not be an event 
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(review the quantizer example). 

If for some finite n, e IF , i = 1, 2, . . . ,n, then also 



[jF^eP , 

2=1 



that is, a finite union of events must also be an event. 



(2.18) 



If Fj e F , z = 1, 2, . . . , then also 



\jF,eE, (2.19) 

i=l 

that is, a countable union of events must also be an event. 

We shall later see alternative ways of describing (2.19), but this form is 
the most common. 

Eq. (2.18) can be considered as a special case of (2.19) since, for exam- 
ple, given a finite collection F^; z = 1, . . . , we can construct an infinite 
sequence of sets with the same union, e.g., given Fk, k = 1,2, . . . , N, con- 
struct an infinite sequence G„ with the same union by choosing G„ = F„ 
lor n = 1,2, ... N and G„ = 0 otherwise. It is convenient, however, to con- 
sider the finite case separately. If a collection of sets satisfies only (2.17) 
and (2.18) but not 2.19, then it is called a, field or algebra of sets. For this 
reason, in elementary probability theory one often refers to “set algebra” 
or to the “algebra of events.” (Don’t worry about why 2.19 might not be 
satisfied.) Both (2.17) and (2.18) can be considered as “closure” properties; 
that is, an event space must be closed under complementation and unions 
in the sense that performing a sequence of complementations or unions of 
events must yield a set that is also in the collection, i.e., a set that is also 
an event. Observe also that (2.17), (2.18), and (A. 11) imply that 

O e F , (2.20) 

that is, the whole sample space considered as a set must be in F; that is, 
it must be an event. Intuitively, LI is the “certain event,” the event that 
“something happens.” Similarly, (2.20) and (2.17) imply that 

0GF, (2.21) 

and hence the empty set must be in F, corresponding to the intuitive event 
“nothing happens.” 
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A few words about the different nature of membership in LI and P is in 
order. If the set F is a subset of Lt, then we write F C Lt. If the subset F 
is also in the event space, then we write F G P. Thus we use set inclusion 
when considering F as a subset of an abstract space, and element inclusion 
when considering F as a member of the event space and hence as an event. 
Alternatively, the elements of Lt are points, and a collection of these points 
is a subset of Lt; but the elements of P are sets — subsets of fl, — and not 
points. A student should ponder the different natures of abstract spaces of 
points and event spaces consisting of sets until the reasons for set inclusion 
in the former and element inclusion in the latter space are clear. Consider 
especially the difference between an element of LI and a subset of Ll that 
consists of a single point. The latter might or might not be an element of P , 
the former is never an element of P . Although the difference might seem to 
be merely semantics, the difference is important and should be thoroughly 
understood. 

A measurable space {Ll,P) is a pair consisting of a sample space Ll 
and an event space or sigma-field P of subsets of Ll. The strange name 
“measurable space” reflects the fact that we can assign a measure such as a 
probability measure, to such a space and thereby form a probability space 
or probability measure space. 

A probability measure F on a measurable space {Ll, P) is an assignment 
of a real number F(F) to every member F of the sigma-field (that is, to 
every event) such that P obeys the following rules, which we refer to as the 
axioms of probability. 

Axiom 2.1 



F(F) > 0 for all F eP (2.22) 

i.e., no event has negative probability. 

Axiom 2.2 



P{Ll) = 1 (2.23) 

i.e., the probability of “everything” is one. 

Axiom 2.3 // Fi, i = 1, 2, . . . , n are disjoint, then 

( n \ n 

i=l / 



F 



(2.24) 




26 



CHAPTER 2. PROBABILITY 



Axiom 2.4 If Fi, i = 1,2, .. . are disjoint, then 

( oo \ oo 

(2-25) 

i=l / i=l 

Note that nothing has been said to the effect that probabilities must be 
sums or integrals, but the first three axioms should be recognizable from 
the three basic properties of nonnegativity, normalization, and additivity 
encountered in the simple examples introduced in the introduction to this 
chapter where probabilities were defined by an integral over a set of a pdf 
or a sum over a set of a pmf. The axioms capture these properties in a gen- 
eral form and will be seen to include more general constructions, including 
multidimensional integrals and combinations of integrals and sums. The 
fourth axiom can be viewed as an extra technical condition that must be 
included in order to get various limits to behave. Just as property (2.19) of 
an event space will later be seen to have an alternative statement in terms 
of limits of sets, the fourth axiom of probability, axiom 2.4, will be shown 
to have an alternative form in terms of explicit limits, a form providing an 
important continuity property of probability. Also as in the event space 
properties, the fourth axiom implies the third. 

As with the defining properties of an event space, for the purposes of dis- 
cussion we have listed separately the finite special case (2.24) of the general 
condition (2.25). The finite special case is all that is required for elemen- 
tary discrete probability. The general condition is required to get a useful 
theory for continuous probability. A good way to think of these conditions 
is that they essentially describe probability measures as set functions de- 
fined by either summing or integrating over sets, or by some combination 
thereof. Hence much of probability theory is simply calculus, especially the 
evaluation of sums and integrals. 

To emphasize an important point: a function P which assigns numbers 
to elements of an event space of a sample space is a probability measure if 
and only if it satisfies all of the four axioms! 

A probability space or experiment is a triple {Ll,T,P) consisting of a 
sample space Lt, an event space T of subsets of Ll, and a probability measure 
P defined for all members of T . 

Before developing each idea in more detail and providing several exam- 
ples of each piece of a probability space, we pause to consider two simple 
examples of the complete construction. The first example is the simplest 
possible probability space and is commonly referred to as the trivial prob- 
ability space. Although useless for application, the model does serve a 
purpose, however, by showing that a well-defined model need not be inter- 
esting. The second example is essentially the simplest nontrivial probability 
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space, a slight generalization of the fair coin flip permitting an unfair coin. 

[2.0] Let 0 be any abstract space and let T = {fl, 0}; that is, T consists 
of exactly two sets — the sample space (everything) and the empty 
set (nothing). This is called the trivial event space. This is a model 
of an experiment where only two events are possible: “Something 
happens” or “nothing happens” — not a very interesting description. 
There is only one possible probability measure for this measurable 
space: P{0) = 1 and P(0) = 0. (Why?) This probability measure 
meets the required rules that define a probability measure; they can 
be directly verified since there are only two possible events. Equations 
(2.22) and (2.23) are obvious. Equations (2.24) and (2.25) follow since 
the only possible values for Fi are 0 and 0. At most one of the Fi is 
indeed Ll, then both sides of the equality are 1. Otherwise, both sides 
are 0. 



[ 2 . 1 ] Let O = {0,1}. Let P = {{0},|1},0 = {0,1}, 0}. Since P con- 
tains all of the subsets of 0, the properties (2.17) through (2.19) are 
trivially satisfied, and hence it is an event space. (There is one other 
possible event space that could be defined for O in this example. What 
is it?) Define the set function P by 



P{F) 



1-p if F={0} 

p if F = {1} 

0 if F = 0 

1 if F = O , 



where p € (0, 1) is a fixed parameter. (If p = 0 or p = 1 the space 
becomes trivial.) It is easily verified that P satisfies the axioms of 
probability and hence is a probability measure. Therefore (Ll,P,P) 
is a probability space. Note that we had to give the value of F(F) 
for all events F, a construction that would clearly be absurd for large 
sample spaces. Note also that the choice of F(F) is not unique for 
the given measurable space (f2,F); we could have chosen any value 
in [0, 1] for F({1}) and used the axioms to complete the definition. 



The preceding example is the simplest nontrivial example of a probabil- 
ity space and provides a rigorous mathematical model for applications such 
as the binary transmission of a single bit or for the flipping of a single bi- 
ased coin once. It therefore provides a complete and rigorous mathematical 
model for the single coin flip of the introduction. 

We now develop in more detail properties and examples of the three 
components of probability spaces: sample spaces, event spaces, and proba- 
bility measures. 
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2.3.1 Sample Spaces 

Intuitively, a sample space is a listing of all conceivable finest-grain, distin- 
guishable outcomes of an experiment to be modeled by a probability space. 
Mathematically it is just an abstract space. 

Examples 

[ 2 . 2 ] A finite space 0 = {a^; k = 1,2,... ,K}. Specific examples are the bi- 
nary space {0,1} and the finite space of integers 2^ = (0,1,2,... ,k — 
!}• 

[ 2 . 3 ] A countably infinite space 0 = {ak] k = 0,1,2,...}, for some se- 
quence {ofc}. Specific examples are the space of all nonnegative inte- 
gers (0, 1,2,...}, which we denote by and the space of all integers 
{. . . , -2,-1, 0, 1, 2, . . . }, which we denote by Z. Other examples are 
the space of all rational numbers, the space of all even integers, and 
the space of all periodic sequences of integers. 

Both examples [2.2] and [2.3] are called discrete spaces. Spaces with 
finite or countably infinite numbers of elements are called discrete spaces. 

[ 2 . 4 ] An interval of the real line 5ft, for example, Ll = (a, b). We might con- 
sider an open interval (a, b), a closed interval [a, b], a half-open interval 
[a, 6) or (a, &], or even the entire real line 5ft itself. (See appendix A 
for details on these different types of intervals.) 

Spaces such as example [2.4] that are not discrete are said to be continu- 
ous. In some cases it is more accurate to think of spaces as being a mixture 
of discrete and continuous parts, e.g., the space Lt = (1,2) U {4} consisting 
of a continuous interval and an isolated point. Such spaces can usually be 
handled by treating the discrete and continuous components separately. 

[ 2 . 5 ] A space consisting of fc— dimensional vectors with coordinates taking 
values in one of the previously described spaces. A useful notation 
for such vector spaces is a product space. Let A denote one of the 
abstract spaces previously considered. Define the Cartesian product 

by 



A^ = { all vectors a = (oq, oi, . . . , ak-i) with a* G A} . 

Thus, for example, 5ft^ is A:— dimensional Euclidean space. (0, 1}^ is the 
space of all binary fc— tuples, that is, the space of all /c— dimensional binary 
vectors. As particular examples, {0,1}^ = {00,01,10,11} and {0,1}^ = 
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{000, 001, 010, oil, 100, 101, 110, 111}. [0, 1]^ is the unit square in the plane. 
[0,1]^ is the unit cube in three-dimensional Euclidean space. 

Alternative notations for a Cartesian product space are 

fc-i 

n “ n ’ 

iGZk i=0 

where again the Ai are all replicas or copies of A, that is, where Ai = A, 
all i. Other notations for such a finite-dimensional Cartesian product are 

Xi^z.A, = x'^I^Ai = A^ . 

This and other product spaces will prove to be a useful means of describ- 
ing abstract spaces modeling sequences of elements from another abstract 
space. 

Observe that a finite-dimensional vector space constructed from a dis- 
crete space is also discrete since if one can count the number of possible 
values one coordinate can assume, then one can count the number of pos- 
sible values that a finite number of coordinates can assume. 

[2.6] A space consisting of infinite sequences drawn from one of the exam- 
ples [2.2] through [2.4]. Points in this space are often called discrete 
time signals. This is also a product space. Let A be a sample space 
and let Ai be replicas or copies of A. We will consider both one-sided 
and two-sided infinite products to model sequences with and without 
a finite origin, respectively. Define the two-sided space 

Ai = { all sequences {ai] i = . . . , —1, 0, 1, . . . }; ai € Ai} , 
and the one-sided space 

Ai = { all sequences [ai, i = 0, 1, . . . }; Oi G Ai} . 

i^Zj^ 

These two spaces are also denoted by UZo^i 

or x“gAi, respectively. 

The two spaces under discussion are often called sequence spaces. Even 
if the original space A is discrete, the sequence space constructed from A 
will be continuous. For example, suppose that Ai = (0,1, 2, 3, 4, 5, 6, 7, 8, 9} 
for all integers i. Then x“gAi is the space of all semiinfinite (one-sided) 
decimal sequences, which is the same as the space of all real numbers in the 
unit interval [0, 1). This follows since if w G fl, then to = (wq, Wi, W 2 , • ■ • ), 
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which can be written as .ujoujiU !2 ■ ■ ■ , which can represent any real number 
in the unit interval by the decimal expansion This space 

contains the decimal representations of all of the real numbers in the unit 
interval, an uncountable infinity of numbers. Similarly, there is an uncount- 
able infinity of one-sided binary sequences one one can express all points in 
the unit interval in the binary number system as sequences to the right of 
the “decimal” point (problem A. 11). 

[ 2 . 7 ] Let A be one of the sample spaces of examples [2.2] through [2.4]. 
Form a new abstract space consisting of all waveforms or functions 
of time with values in A, for example, all real- valued time functions 
or continuous time signals. This space is also modeled as a product 
space. For example, the infinite two-sided space for a given A is 

Ai = { all waveforms {x{t)-, t G (— oo, oo)}; x{t) G A, allt}, 

teSR 

with a similar definition for one-sided spaces and for time functions 
on a finite time interval. 

Note that we indexed sequences (discrete time signals) using subscripts, 
as in and we indexed waveforms (continuous time signals) using paren- 
theses, as in x(t). In fact, the notations are interchangeable; we could 
denote waveforms as (x(t); t G 3?} or as {xt, t G 3?}. The notation using 
subscripts for sequences and parentheses for waveforms is the most com- 
mon, and we will usually stick to it. Yet another notation for discrete time 
signals is x[n], a common notation in the digital signal processing literature. 
It is worth remembering that vectors, sequences, and waveforms are all just 
indexed collections of numbers; the only difference is the index set: finite 
for vectors, countably infinite for sequences, and continuous for waveforms. 

★General Product Spaces 

All of the product spaces we have described can be viewed as special cases 
of the general product space defined next. 

Let X be an index set such as a finite set of integers the set of all 
integers Z, the set of all nonnegative integers Z+, the real line 3?, or the 
nonnegative reals [0, oo). Given a family of spaces {At; t G X}, define the 
product space 

Aj = { all {at; t G X}; at G At, all t} . 
tel 

The notation x^^xAt is also used for the same thing. Thus product spaces 
model spaces of vectors, sequences, and waveforms whose coordinate values 
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are drawn from some fixed space. This leads to two notations for the space 
of all fc— dimensional vectors with coordinates in A : A^ and A^'‘ . The 
shorter and simpler notation is usually more convenient. 

2.3.2 Event Spaces 

Intuitively, an event space is a collection of subsets of the sample space or 
groupings of elementary events which we shall consider as physical events 
and to which we wish to assign probabilities. Mathematically, an event 
space is a collection of subsets that is closed under certain set-theoretic 
operations; that is, performing certain operations on events or members of 
the event space must give other events. Thus, for example, if in the example 
of a single voltage measurement example we have = 3? and we are told 
that the set of all voltages greater than 5 volts = {w : a; > 5} is an event, 
that is, is a member of a sigma-field T of subsets of 3?, then necessarily 
its complement {w : w < 5} must also be an event, that is, a member 
of the sigma-field T . If the latter set is not in T then T cannot be an 
event space! Observe that no problem arises if the complement physically 
cannot happen — events that “cannot occur” can be included in T and 
then assigned probability zero when choosing the probability measure P . 
For example, even if you know that the voltage does not exceed 5 volts, 
if you have chosen the real line 3? as your sample space, then you must 
include the set {r : r > 5} in the event space if the set {r : r < 5} is an 
event. The impossibility of a voltage greater than 5 is then expressed by 
assigning P{{r : r > 5}) = 0. 

While the definition of a sigma-field requires only that the class be closed 
under complementation and countable unions, these requirements immedi- 
ately yield additional closure properties. The countably infinite version of 
DeMorgan’s “laws” of elementary set theory require that if i = 1,2,... 
are all members of a sigma-field, then so is 

oo / oo \ ^ 

i=l \i=l / 

It follows by similar set-theoretic arguments that any countable se- 
quence of any of the set-theoretic operations (union, intersection, com- 
plementation, difference, symmetric difference) performed on events must 
yield other events. Observe, however, that there is no guarantee that un- 
countable operations on events will produce new events; they may or may 
not. For example, if we are told that {F),; r G [0, 1]} is a family of events, 
then it is not necessarily true that Ure[o i]-^r> is an event (see problem 2.2 
for an example). 
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The requirement that a finite sequence of set-theoretic operations on 
events yields other events is an intuitive necessity and is easy to verify for 
a given collection of subsets of an abstract space: It is intuitively necessary 
that logical combinations of {and and or and not) of events corresponding 
to physical phenomena should also be events to which a probability can 
be assigned. If you know the probability of a voltage being greater than 
zero and you know the probability that the voltage is not greater than 5 
volts, then you should also be able to determine the probability that the 
voltage is greater than zero but not greater than 5 volts. It is easy to verify 
that finite sequences of set-theoretic combinations yield events because the 
finiteness of elementary set theory usually yields simple proofs. 

A natural question arises in regard to (2.17) and (2.18): Why not try 
to construct a useful probability theory on the more general notion of a 
field rather than a sigma-field? The response is that it unfortunately does 
not work. Probability theory requires many results involving limits, and 
such asymptotic results require the infinite relations of (2.19) and (2.25) to 
work. In some special cases, such as single coin flipping or single die rolling, 
the simpler finite results suffice because there are only a finite number of 
possible outcomes, and hence limiting results become trivial — any finite 
field is automatically a sigma-field. If, however, one flips a coin forever, 
then there is an uncountable infinity of possible outcomes, and the asymp- 
totic relations become necessary. Let Lt be the space of all one-sided binary 
sequences. Suppose that you consider the smallest field formed by all finite 
set-theoretic operations on the individual one-sided binary sequences, that 
is, on singleton sets in the sequence space. Then many countably infinite 
sets of binary sequences (say the set of all periodic sequences) are not events 
since they cannot be expressed as finite sequences of set-theoretic opera- 
tions on the singleton sets. Obviously, the sigma-field formed by including 
countable set-theoretic operations does not have this defect. This is why 
sigma-fields must be used rather than fields. 



Limits of Sets 

The condition (2.19) can be related to a condition on limits by defining 
the notion of a limit of a sequence of sets. This notion will prove useful 
when interpreting the axioms of probability. Consider a sequence of nested 
sets Fn,n = 1 , 2 ,..., sets with the property that each set contains its 
predecessor, that is, that F„_i C for all n. Such a sequence of sets 
is said to be increasing. For example, the sequence Fn = [1,2 — 1/n) of 
subsets of the real line is increasing. The sequence (— n, a) is also increasing. 
Intuitively, the first example increases to a limit of [1, 2) in the sense that 
every point in the set [1, 2) is eventually included in one of the F^. Similarly, 
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the sequence in the second example increases to (— 00 , a). Formally, the limit 
of an increasing sequence of sets can be defined as the union of all of the 
sets in the sequence since the union contains all of the points in all of the 
sets in the sequence and does not contain any points not contained in at 
least one set (and hence an infinite number of sets) in the sequence: 

00 

lim = M F„ . 

n—*oo 

n—1 

Figure 2. 2. (a) illustrates such a sequence in a Venn diagram. 





Figure 2.2: (a) Increasing sets, (b) decreasing sets 



Thus the limit of the sequence of sets [1, 2 — 1/n) is indeed the set [1, 2), 
as desired, and the limit of (— n, a) is ( 00 , a). If F is the limit of a sequence 
of increasing sets Fn, then we write Fn | F. 

Similarly, suppose that F^; n = 1,2,... is a decreasing sequence of 
nested sets in the sense that Fn C Fn-i for all n as illustrated by the Venn 
diagram in Figure 2.2(b). For example, the sequences of sets [1, 1 + 1/n) 
and (1 — 1/n, 1 + 1 /n) are decreasing. Again we have a natural notion of the 
limit of this sequence: Both these sequences of sets collapse to the point of 
singleton set {1} — the point in common to all the sets. This suggests a 
formal definition based on the countably infinite intersection of the sets. 

Given a decreasing sequence of sets n = 1, 2, . . . , we define the limit 
of the sequence by 

00 

lim = n F„ , 

n— ^00 ' ' 

n—1 
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that is, a point is in the limit of a decreasing sequence of sets if and only if 
it is contained in all the sets of the sequence. If F is the limit of a sequence 
of decreasing sets then we write Fn i F. 

Thus, given a sequence of increasing or decreasing sets, the limit of the 
sequence can be defined in a natural way: the union of the sets of the 
sequence or the intersection of the sets of the sequence, respectively. 

Say that we have a sigma-field IF and an increasing sequence of sets 
Fn, n = 1,2, . . . of sets in the sigma-field. Since the limit of the sequence 
is defined as a union and since the union of a countable number of events 
must be an event, then the limit must be an event. For example, if we are 
told that the sets [1,2 — 1/n) are all events, then the limit [1,2) must also 
be an event. If we are told that all finite intervals of the form (a, b), where 
a and b are finite, are events, then the semi-infinite interval (— oo, 6) must 
also be an event, since it is the limit of the sequence of sets {—n,b) and 
n ^ oo. 

By a similar argument, if we are told that each set in a decreasing 
sequence Fn is an event, then the limit must be an event, since it is an 
intersection of a countable number of events. Thus, for example, if we are 
told that all finite intervals of the form (a, b) are events, then the points 
of singleton sets must also be events, since a point {a} is the limit of the 
decreasing sequence of sets (a — l/n,a+ 1/n). 

If a class of sets is only a field rather than a sigma-field, that is, if it 
satisfies only (2.17) and (2.18), then there is no guarantee that the class 
will contain all limits of sets. Hence, for example, knowing that a class of 
sets contains all half-open intervals of the form (a, 6] for a and b finite does 
not ensure that it will also contain points or singleton sets! In fact, it is 
straightforward to show that the collection of all such half-open intervals 
together with the complements of such sets and all finite unions of the 
intervals and complements forms a field. The singleton sets, however, are 
not in the field! (See problem 2.5.) 

Thus if we tried to construct a probability theory based on only a field, 
we might have probabilities defined for events such as (a,b) meaning “the 
output voltage of a measurement is between a and b” and yet not have 
probabilities defined for a singleton set {a} meaning “the output voltage is 
exactly a.” By requiring that the event space be a sigma-field instead of 
only a field, we are assured that all such limits are indeed events. 

It is a straightforward exercise to show that given (2.17) and (2.18), 
property (2.19) is equivalent to either of the following: 

11 Fn G IF; n = 1,2,... , is a decreasing sequence or an increasing se- 
quence, then 



lim Fn G F ■ 



(2.26) 
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We have already seen that (2.19) implies (2.26). For example, if (2.26) is 
true and G„ is an arbitrary sequence of events, then define the increasing 
sequence 

n 

Fn^ljG,. 

Obviously F^-i C and then (2.26) implies (2.19), since 

oo oo 

U Gi = M = lim . 

n— ^oo 

2=1 n —1 



Examples 

As we have noted, for a given sample space the selection of an event space is 
not unique; it depends on the events to which it is desired to assign probabil- 
ities and also on analytical limitations on the ability to assign probabilities. 
We begin with two examples that represent the extremes of event spaces 
— one possessing the minimum quantity of sets and the other possessing 
the maximum. We then study event spaces useful for the sample space 
examples of the preceding section. 

[ 2 . 8 ] Given a sample space 0, then the collection {n,0} is a sigma-field. 
This is just the trivial event space already treated in example [2.0]. 
Observe again that this is the smallest possible event space for any 
given sample space because no other event space can have fewer ele- 
ments. 

[ 2 . 9 ] Given a sample space O, then the collection of all subsets o/ O is a 
sigma-field. This is true since any countable sequence of set-theoretic 
operations on subsets of Lt must yield another subset of 0 and hence 
must be in the collection of all possible subsets. The collection of all 
subsets of a space is called the power set of the space. Observe that 
this is the largest possible event space for the given sample space, 
because it contains every possible subset of the sample space. 

This sigma-field is a useful event space for the sample spaces of examples 
[2.2] and [2.3], that is, for sample spaces that are discrete. We shall always 
take our event space as the power set when dealing with a discrete sample 
space (except possibly for a few perverse homework problems) . A discrete 
sample space with n elements has a power set with 2" elements (problem 
2.4). For example, the power set of the binary sample space O = {0, 1} is 
the collection {{0},{1},0 = {0,1}, 0}, a list of all possible subsets of the 
space. 
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Unfortunately, the power set is too large to be useful for continuous 
spaces. To treat the reasons for this is beyond the scope of a book at this 
level, but we can say that it is not possible in general to construct interesting 
probability measures on the power set of a continuous space. There are 
special cases where we can construct particular probability measures on 
the power set of a continuous space by mimicking the construction for a 
discrete space (see, e.g., problems 2.4, 2.6, and 2.9). Truly continuous 
experiments cannot, however, be rigorously defined for such a large event 
space because integrals cannot be defined over all events in such spaces. 

While both of the preceding examples can be used to provide event 
spaces for the special case of U = 3?, the real line, neither leads to a useful 
probability theory in that case. In the next example we consider another 
event space for the real line that is more useful and, in fact, is used almost 
always for 3? and higher dimensional Euclidean spaces. First, however, we 
need to treat the idea of generating an event space from a collection of 
important events. Intuitively, given a collection of important sets Q that 
we require to be events, the event space a{Q) generated by Q is the smallest 
event space T to which all the sets in Q belong. That is, cr{Q) is an event 
space, it contains all the sets in Q, and no smaller collection of sets satisfies 
these two conditions. 

Regardless of the details, it is worth emphasizing the key points of this 
discussion. 

• The notion of a generated sigma-field allows one to describe an event 
space for the real line, the Borel field, that contains all physically im- 
portant events and which will lead to a useful calculus of probability. 
It is usually not important to understand the detailed structure of 
this event space past the facts that it 

— is indeed an event space, and 

— it contains all the important events such as intervals of all types 
and points. 

• The notion of a generated sigma-field can be used to extend the event 
space of the real line to event spaces of vectors, sequences, and wave- 
forms taking on real values. Again the detailed structure is usually 
not important past the fact that it 

— is indeed an event space, and 

— it contains all the important events such as those described by 
requiring any finite collection of coordinate values to lie within 
intervals. 
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★Generating Event Spaces 

Any useful event space for the real line should include as members all 
intervals of the form (a, 6) since we certainly wish to consider events of 
the form “the output voltage is between 3 and 5 volts.” Furthermore, we 
obviously require that the event space satisfy the defining properties for an 
event space, that is, that we have a collection of subsets of O that satisfy 
properties (2.17) through (2.19). A means of accomplishing both of these 
goals in a relatively simple fashion is to define our event space as the smallest 
sigma-field that contains the desired subsets, to wit, the intervals and all 
of their countable set-theoretic combinations (bewildering as it may seem, 
this is not the same as all subsets of 3?) . Of course, although a sigma-field 
that is based on the intervals is most useful, it is also possible to consider 
other starting points. These considerations motivate the following general 
definition. 

Given a sample space Lt (such as the real line 3?) and an arbitrary class 
Q of subsets of O — usually the class of all open intervals of the form (a, b) 
when 0 = 3? — define cr(^), the sigma-field generated by the class Q, to be 
the smallest sigma-field containing all of the sets in Q, where by “smallest” 
we mean that if P is any sigma-field and it contains Q, then it contains 
cr{Q). (See any book on measure theory, e.g.. Ash [1].) 

For example, as noted before, we might require that a sigma-field of the 
real line contain all intervals; then it would also have to contain at least 
all complements of intervals and all countable unions and intersections of 
intervals and all countable complements, unions, and intersections of these 
results, ad infinitum. This technique will be used several times to specify 
useful event spaces in complicated situations such as continuous simple 
spaces, sequence spaces, and function spaces. We are now ready to provide 
the proper, most useful event space for the real line. 

[2.10] Given the real line 3?, the Borel field (or, more accurately, the Borel 
sigma-field) is defined as the sigma-field generated by all the open 
intervals of the form (a, h). The members of the Borel field are called 
Borel sets. We shall denote the Borel field by .8(3?), and hence 

8(3?) = (T ( all open intervals ) . 



Since 8(3?) is a sigma-field and since it contains all of the open intervals, 
it must also consider limit sets of the form 

= lim (— n, b) , 

n — »-oo 

= lim (a, n) , 

n — »-oo 



(- 00 , 6 ) 
(a, oo) 
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and 



{a} = lim (a — 1/n, a + 1/n) , 

n—*oo 



that is, the Borel field must include semi-infinite open intervals and the 
singleton sets or individual points. Furthermore, since the Borel field is a 
sigma-field it must contain differences. Hence it must contain semi-infinite 
half-open sets of the form 



(— 00 , b] = (— 00 , 00 ) — (b, 00 ) , 



and since it must contain unions of its members, it must contain half-open 
intervals of the form 

(a, b] = {a, b) U {6} and [a, b) = {a, b) U {a} . 

In addition, it must contain all closed intervals and all finite or countable 
unions and complements of intervals of any of the preceding forms. Roughly 
speaking, the Borel field contains all subsets of the real line that can be 
obtained as an approximation of countable combinations of intervals. It is 
a deep and difficult result of measure theory that the Borel field of the real 
line is in fact different from the power set of the real line; that is, there 
exist subsets of the real line that are not in the Borel field. While we will 
not describe such a subset, we can guarantee that these “unmeasurable” 
sets have no physical importance, that they are very hard to construct, and 
that an engineer will never encounter such a subset in practice. It may, 
however, be necessary to demonstrate that some weird subset is in fact an 
event in this sigma-field. This is typically accomplished by showing that it 
is the limit of simple Borel sets. 

In some cases we wish to deal not with a sample space that is the entire 
real line, but one that is some subset of the real line. In this case we define 
the Borel field as the Borel field of the real line “cut down” to the smaller 
space. 

Given that the sample space, H, is a Borel subset of the real line 5ft, the 
Borel field of O, denoted B{0), is defined as the collection of all sets of the 
form Fnn, for F G B(Sf{); that is, the intersection of 12 with all of the Borel 
sets of 5ft forms the class of Borel sets of 12. 

It can be shown (problem 2.3) that, given a discrete subset A of the 
real line, the Borel field B(A) is identical to the power set of A. Thus, for 
the first three examples of sample spaces, the Borel field serves as a useful 
event space since it reduces to the intuitively appealing class of all subsets 
of the sample space. 

The remaining examples of sample spaces are all product spaces. The 
construction of event spaces for such product spaces — that is, spaces of 
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vectors, sequences, or waveforms — is more complicated and less intuitive 
than the constructions for the preceding event spaces. In fact, there are 
several possible techniques of construction, which in some cases lead to 
different event spaces. We wish to convey an understanding of the structure 
of such event spaces, but we do not wish to dwell on the technical difficulties 
that can be encountered. Hence we shall study only one of the possible 
constructions — the simplest possible definition of a product sigma-field — 
by making a direct analogy to a product sample space. This definition will 
suffice for most systems studied herein, but it has shortcomings. At this 
time we mention one particular weakness: The event space that we shall 
define may not be big enough when studying the theory of continuous time 
random processes. 

[2.11] Given an abstract space A, a sigma-field T of subsets of A, an index 
set X, and a product sample space of the form 

A^ = l[At , 

tel 

where the At are all replicas of A, the product sigma-field 

tel 

is defined as the sigma-field generated by all “one-dimensional” sets 
of the form 

{{at, t Gl} : at & F for t = s and at G At for t ^ s} 

for some s G I and some F G F; that is, the product sigma-field 
is the sigma-field generated by all “one-dimensional” events formed 
by collecting all of the vectors or sequences or waveforms with one 
coordinate constrained to lie in a one-dimensional event and with the 
other coordinates unrestricted. The product sigma-field must contain 
all such events; that is, for all possible indices s and all possible events 
F. 

Thus, for example, given the one-dimensional abstract space 3?, the real 
line along with its Borel field. Figure 2.3 (a)-(c) depicts three examples of 
one-dimensional sets in 3?^, the two-dimensional Euclidean plane. Note, for 
example, that the unit circle {{x,y) \ + <T\ is not a one-dimensional 

set since it requires simultaneous constraints on two coordinates. 

More generally, for a fixed finite k the product sigma-field ,8(3?)^*' (or 
simply 8(3?)*) of fc— dimensional Euclidean space 3?* is the smallest sigma- 
field containing all one-dimensional events of the form {x = (xq, Xi, . . . , x^-i) : 
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(3, 6)}, One- and two-dimensional events in two-dimensional space. 



Xi G F} for some i = 0, 1, . . . ,k — 1 and some Borel set F of 3?. The two- 
dimensional example Figure 2.3(a) has this form with fc = 2,i = 0, and 
F = (1,3). This one-dimensional set consists of all values in the infinite 
rectangle between 1 and 3 in the xq direction and between — oo and oo in 
the xi direction. 

To summarize, we have defined a space A with event space F, and an 
index set X such as Z+,Z,3?, or [0,1), and we have formed the product 
space A^ and the associated product event space We know that this 
event space contains all one-dimensional events by construction. We next 
consider what other events must be in F^ by virtue of its being an event 
space. 

After the one-dimensional events that pin down the value of a single 
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coordinate of the vector or sequence or waveform, the next most general 
kinds of events are finite-dimensional sets that separately pin down the 
values of a finite number of coordinates. Let /C be a finite collection of 
members of T and hence /C C X. Say that K, has K members, which 
we shall denote as {kf, i = 0, 1, . . . , iL — 1}. These K numbers can be 
thought of as a collection of sample times such as {1,4,8,156,1027} for 
a sequence or {1.5,9.07,40.0,41.2,41.3} for a waveform. We assume for 
convenience that the sample times are ordered in increasing fashion. Let 
{Ffc. ; z = 0, 1, . . . , iL — 1} be a collection of members of T . Then a set of 
the form 

{{xt, tGl}: Xki e Xfc,; z = 0, 1, . . . ,K -1} 

is an example of a finite-dimensional set. Note that it collects all sequences 
or waveforms such that a finite number of coordinates are constrained to 
lie in one-dimensional events. An example of two-dimensional sets of this 
form in two-dimensional space is illustrated in Figure 2.3(d). Observe there 
that when the one-dimensional sets constraining the coordinates are inter- 
vals, then the two-dimensional sets are rectangles. Analogous to the two- 
dimensional example, finite-dimensional events having separate constraints 
on each coordinate are called rectangles. Observe, for example, that a circle 
or sphere in Euclidean space is not a rectangle because it cannot be defined 
using separate constraints on the coordinates; the constraints on each co- 
ordinate depend on the values of the others — e.g., in two dimensions we 
require that Xq < 1 — x^. 

Note that Figure 2.3(d) is just the intersection of examples (a) and (b) of 
Figure 2.3. In fact, in general we can express finite-dimensional rectangles 
as intersections of one-dimensional events as follows: 



K-l 

{{xt; t Gl} : Xki G Fk,; i = 0, 1, . . . ,K-1} = Q {{xt; t Gl} : Xk, G Fi} , 

i^O 

that is, a set constraining a finite number of coordinates to each lie in 
one-dimensional events or sets in P is the intersection of a collection of 
one-dimensional events. Since is a sigma- field and since it contains the 
one-dimensional events, it must contain such finite intersections, and hence 
it must contain such finite-dimensional events. 

By concentrating on events that can be represented as the finite inter- 
section of one-dimensional events we do not mean to imply that all events 
in the product event space can be represented in this fashion — the event 
space will also contain all possible limits of finite unions of such rectangles, 
complements of such sets, and so on. For example, the unit circle in two 
dimensions is not a rectangle, but it can be considered as a limit of unions 
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of rectangles and hence is in the event space generated by the rectangles. 
(See problem 2.31.) 

The moral of this discussion is that the product sigma-field for spaces 
of sequences and waveforms must contain (but not consist exclusively of) 
all sets that are described by requiring that the outputs of coordinates for 
a finite number of events lie in sets in the one-dimensional event space E. 

We shall further explore such product event spaces when considering 
random processes, but the key points remain 

1. a product event space is a sigma- field, and 

2. it contains all “one-dimensional events” consisting of subsets of the 
product sample space formed by grouping together all vectors or se- 
quences or waveforms having a single fixed coordinate lying in a one- 
dimensional event. In addition, it contains all rectangles or finite- 
dimensional events consisting of all vectors or sequences or wave- 
forms having a finite number of coordinates constrained to lie in one- 
dimensional events. 

2.3.3 Probability Measures 

The defining axioms of a probability measure as given in equations (2.22) 
through (2.25) correspond generally to intuitive notions, at least for the 
first three properties. The first property requires that a probability be 
a nonnegative number. In a purely mathematical sense, this is an arbi- 
trary restriction, but it is in accord with the long history of intuitive and 
combinatorial developments of probability. Probability measures share this 
property with other measures such as area, volume, weight, and mass. 

The second defining property corresponds to the notion that the prob- 
ability that something will happen or that an experiment will product one 
of its possible outcomes is one. This, too, is mathematically arbitrary but 
is a convenient and historical assumption. (From childhood we learn about 
things that are “100% certain;” obviously we could as easily take 1 or tt 
(but not infinity — why?) to represent certainty.) 

The third property, “additivity” or “finite additivity,” is the key one. 
In English it reads that the probability of occurrence of a finite collection 
of events having no points in common must be the sum of the probabilities 
of the separate events. More generally, the basic assumption of measure 
theory is that any measure — probabilistic or not — such as weight, volume, 
mass, and area should be additive: the mass of a group of disjoint regions 
of matter should be the sum of the separate masses; the weight of a group 
of objects should be the sum of the individual weights. Equation (2.24) 
only pins down this property for finite collections of events. The additional 
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restriction of (2.25), called countable additivity, is a limiting or asymptotic 
or infinite version, analogous to (2.19) for set algebra. This again leads 
to the rhetorical questions of why the more complicated, more restrictive, 
and less intuitive infinite version is required. In fact, it was the addition of 
this limiting property that provided the fundamental idea for Kolmogorov’s 
development of modern probability theory in the 1930s. 

The response to the rhetorical question is essentially the same as that 
for the asymptotic set algebra property: Countably infinite properties are 
required to handle asymptotic and limiting results. Such results are crucial 
because we often need to evaluate the probabilities of complicated events 
that can only be represented as a limit of simple events. (This is analogous 
to the way that integrals are obtained as limits of finite sums.) 

Note that it is countable additivity that is required. Uncountable ad- 
ditivity cannot be defined sensibly. This is easily seen in terms of the fair 
wheel mentioned at the beginning of the chapter. If the wheel is spun, any 
particular number has probability zero. On the other hand, the probability 
of the event made up of all of the uncountable numbers between 0 and 1 is 
obviously one. If you consider defining the probability of all the numbers 
between 0 and 1 to be the uncountable sum of the individual probabilities, 
you see immediately the essential contradiction that results. 

Since countable additivity has been added to the axioms proposed in 
the introduction, the formula (2.11) used to compute probabilities of events 
broken up by a partition immediately extends to partitions with a countable 
number of elements; that is, if Tfc; k = 1,2, . . . forms a partition of U into 
disjoint events (F„ n = 0 if n yf /c and IJ^ T), = U), then for any event G 

OO 

P(G) = ^P(GnUfc). (2.27) 

k=l 



Limits of Probabilities 

At times we are interested in finding the probability of the limit of a se- 
quence of events. To relate the countable additivity property of (2.25) 
to limiting properties, recall the discussion of the limiting properties of 
events given earlier in this chapter in terms of increasing and decreas- 
ing sequences of events. Say we have an increasing sequence of events 
P„; n = 0,1,2,..., P„_i C Fn, and let F denote the limit set, that is, 
the union of all of the Fn. We have already argued that the limit set P is 
itself an event. Intuitively, since the P„ converge to F, the probabilities of 
the Fn should converge to the probability of P. Such convergence is called 
a continuity property of probability and is very useful for evaluating the 
probabilities of complicated events as the limit of a sequence of probabili- 




44 



CHAPTER 2. PROBABILITY 



ties of simpler events. We shall show that countable additivity implies such 
continuity. To accomplish this, define the sequence of sets Go = Fq and 
G„ = Fn — Fn-i for n = 1, 2, . . . . The G„ are disjoint and have the same 
union as do the Fn (see Figure 2.2(a) as a visual aid). Thus we have from 
countable additivity that 



P 




\k=0 / 

\fc -0 / 

oo 



hm VP(Gfc) , 

77, — ^OO f ^ 



k—0 



where the last step simply uses the definition of an infinite sum. Since 
Gn = Fn- Fn-i and F„_i C Fn, P{Gn) = P{Fn) - P{Fn-i) and hence 



k—0 



P{Fo) + ^ {P{Fn) - P{Fn-l)) 



P{Fn), 



an example of what is called a “telescoping sum” where each term cancels 
the previous term and adds a new piece, i.e.. 



P{Fn) = P{Fn) - P{Fn-l) 

+ P{Fn-l) - P{Fn-2) 
+ P{Fn-2) - P{Fn-3) 



+ P{FO-P{Fo) 

+ P{Fo) 

Combining these results completes the proof of the following statement. 
If Fn is a sequence of increasing events, then 

P ( lim = lim P{Fn) , (2.28) 

\7i— ^oo / n — »-oo 

that is, the probability of the limit of a sequence of increasing 
events is the limit of the probabilities. 
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Note that the sequence of probabilities on the right-hand side of (2.28) is in- 
creasing with increasing n. Thus, for example, probabilities of semi-infinite 
intervals can be found as a limit as P((— oo,a]) = lim„^oo a])- A 
similar argument can be used to show that one can also interchange the 
limit with the probability measure given a sequence of decreasing events; 
that is. 



If F„ is a sequence of decreasing events, then 

P ( lim = lim P(F„) . (2.29) 

\n — »^oo / n— ^oo 

that is, the probability of the limit of a sequence of decreasing 
events is the limit of the probabilities. 

Note that the sequence of probabilities on the right-hand side of (2.29) 
is decreasing with increasing n. Thus, for example, the probabilities of 
points can be found as a limit of probabilities of intervals, P({a}) = 
lim„^oo P{{a- l/n,a+ 1/n)). 

It can be shown (see problem 2.20) that, given (2.22) through (2.24), 
the three conditions (2.25), (2.28), and (2.29) are equivalent; that is, any 
of the three could serve as the fourth axiom of probability. 

Property (2.28) is called continuity from below, and (2.29) is called conti- 
nuity from above. The designations “from below” and “from above” relate 
to the direction from which the respective sequences of probabilities ap- 
proach their limit. These continuity results are the basis for using integral 
calculus to compute probabilities, since integrals can be expressed as limits 
of sums. 



2.4 Discrete Probability Spaces 

We now provide several examples of probability measures on our examples 
of sample spaces and sigma-fields and thereby give some complete examples 
of probability spaces. 

The first example formalizes the description of a probability measures 
as a sum of a pmf as introduced in the introductory section. 

[2.12] Let be a finite set and let T be the power set of LI. Suppose that 
we have a function p(w) that assigns a real number to each sample 
point w in such a way that 



p{uj) > 0 , all w G n 



(2.30) 
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and 



^pM = 1. (2.31) 

CJ G 

Define the set function P by 

p{F) = j2p{iv)j2 1f{co)p{uj) , all F e P (2.32) 

uj^F cj G ^ 

where 1 _f(w) is the indicator function of the set 1 if w € F and 0 
otherwise. 

For simplicity we drop the w G 12 underneath the sum; that is, when 
no range of summation is explicit, it should be assumed the sum is over all 
possible values. Thus we can abbreviate (2.32) to 

Ff) = E If{uj)p{uj) , all F e F (2.33) 

P is easily verified to be a probability measure: It obviously satisfies 
axioms 2.1 and 2.2. It is finitely and countably additive from the properties 
of sums. In particular, given a sequence of disjoint events, only a finite 
number can be distinct (since the power set of a finite space has only a 
finite number of members). To be disjoint, the balance of the sequence 
must equal 0. The probability of the union of these sets will be the finite 
sum of the p{uj) over the points in the union which equals the sum of the 
probabilities of the sets in the sequence. Example [2.1] is a special case of 
example [2.12], as is the coin flip example of the introductary section. 

The summation (2.33) used to define probability measures for a discrete 
space is a special case of a more general weighted sum, which we pause 
to define and consider. Suppose that g is a, real- valued function defined 
on 12, i.e., g : LI ^ IR assigns a real number g{oj) to every w G 12. We 
could consider more general complex-valued functions, but for the moment 
it is simpler to stick to real valued functions. Also, we could consider 
subsets of 3?, but we leave it more generally at this time. Recall that in 
the introductory section we considered such a function to be an example 
of signal processing and called it a random variable. Given a pmf p, define 
the expectation^ of g (with respect to p) as 

F{g) = (2.34) 

^This is not in fact the fundamental definition of expectation that will be 
introduced in chapter 4, but it will be seen to be equivalent 
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With this definition (2.33) with g{uj) = 1f(w) yields 

P{F) = E{lp), (2.35) 

showing that the probability of an event is the expectation of the indicator 
function of the event. Mathematically, we can think of expectation as a 
generalization of the idea of probability since probability is the special case 
of expectation that results when the only functions allowed are indicator 
functions. 

Expectations are also called probabilistic averages or statistical aver- 
ages. For the time being, probabilities are the most important examples 
of expectation. We shall see many examples, however, so it is worthwhile 
to mention a few of the most important. Suppose that the sample space 
is a subset of the real line, e.g., Z or Z„. One of the most commonly 
encountered expectations is the mean or first moment 

m = E tvp{uj), (2.36) 

where g{u>) = to, the identity function. A more general idea is the fcth 
moment defined by 

= (2.37) 

so that m = After the mean, the most commonly encountered mo- 

ment in practice is the second moment, 

^ \uj\‘^p{uj). (2.38) 

Moments can be thought of as parameters describing a pmf, and some 
computations involving signal processing will turn out to depend only on 
certain moments. 

A slight variation on k order moments is the so-called centralized mo- 
ments formed by substracting the mean before taking the power: 

^\u) -m\’"p{uj), (2.39) 

but the only such moment commonly encountered in practice is the variance 

(T^ = ^(w — to)^p(o;). (2.40) 
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The variance and the second moment are easily related as 

= y^(u;^ — 2u)m + rn^)p{uj) 

p{uj) 

= — 2w? + TO^ 

= TO*-^^ — (2-41) 

Probability Mass Functions 

A function p{u>) satisfying (2.30) and (2.31) is called a probability mass func- 
tion or pmf. It is important to observe that the probability mass function 
is defined only for points in the sample space, while a probability measure 
is defined for events, sets which belong to an event space. Intuitively, the 
probability of a set is given by the sum of the probabilities of the points 
as given by the pmf. Obviously it is much easier to describe the proba- 
bility function than the probability measure since it need only be specified 
for points. The axioms of probability then guarantee that the probability 
function can be used to compute the probability measure. Note that given 
one, we can always determine the other. In particular, given the pmf p, we 
can construct P using (2.32). Given P, we can find the corresponding pmf 
p from the formula 

p{uj) = P({w}) . 

We list below several of the most common examples of pmf’s. The 
reader should verify that they are all indeed valid pmf’s, that is, that they 
satisfy (2.30) and (2.31). 

The binary pmf. Lt = {0, 1}; p(0) = 1 — p, p(l) = p, where p is a 
parameter in (0, 1). 

A uniform pmf. = Z„ = {0, 1, . . . , n — 1} and p{k) = Ijn; k £ Zn- 

The binomial pmf. Ll = Z„_|_i = {0, 1, . . . , n} and 

P(^)= ( I )p'=(l-p)”-"; , 

n! 

kl{n — k)l 




where 
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is the binomial coefficient. 



The binary pmf is a probability model for coin flipping with a biased 
coin or for a single sample of a binary data stream. A uniform pmf on Zg 
can model the roll of a fair die. Observe that it would not be a good model 
for ASCII data since, for example, the letters t and e and the symbol for 
space have a higher probability than other letters. The binomial pmf is a 
probability model for the number of heads in n successive independent flips 
of a biased coin, as will later be seen. 

The same construction provides a probability measure on countably 
infinite spaces such as Z and Z_|_ . It is no longer as simple to prove countable 
additivity, but it should be fairly obvious that it holds and, at any rate, it 
follows from standard results in elementary analysis for convergent series. 
Hence we shall only state the following example without proving countable 
additivity, but bear in mind that it follows from the properties of infinite 
summations. 

[ 2 . 13 ] Let be a space with a countably infinite number of elements and 
let T be the power set of LI. Then if p(w); a; € satisfies (2.30) and 
(2.31), the set function P defined by (2.32) is a probability measure. 

Two common examples of pmf’s on countably infinite sample spaces 
follow. The reader should test their validity. 

The geometric pmf. H = {1,2,3,...} and p{k) = (1 -p)'= V; k = 
1,2, ... , where p G (0, 1) is a parameter. 

The Poisson pmf. Lt = Z+ = {0,1,2,...} and p{k) = {\^e~^)/k\, 
where A is a parameter in (0,oo). (Keep in mind that 0! = 1.) 



We will later see the origins of several of these pmf’s and their appli- 
cations. For example, both the binomial and the geometric pmf will be 
derived from the simple binary pmf model for flipping a single coin. For 
the moment they should be considered as common important examples. 
Various properties of these pmf’s and a variety of calculations involving 
them are explored in the problems at the end of the chapter. 

Computational Examples 

The various named pmf’s provide examples for computing probabilities and 
other expectations. Although much of this is prerequisite material, it does 
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not hurt to collect several of the more useful tricks that arise in evaluating 
sums. The binary pmf is too simple to alone provide much interest, so first 
consider the uniform pmf on Z„. This is trivially a valid pmf since it is 
nonnegative and sums to 1. The probability of any set is simply 



n 



#{F) 

I 

n 



where #(F) denotes the number of elements or points in the set F. The 
mean is given by 



nin + 1) , , 

m = J2k= ^ . (2.42) 

k=l 

a standard formula easily verified by induction, as detailed in appendix B. 
The second moment is given by 



m(2) + , (2.43) 

k=l 

as can also be verified by induction. The variance can be found by combin- 
ing (2.43), (2.42), and (2.41). 

The binomial pmf is more complicated. The first issue is to prove that it 
sums to one and hence is a valid pmf (it is obviously nonnegative) . This is 
accomplished by recalling the binomial theorem from high school algebra: 

(a + &)" = f^( ^ (2.44) 



and setting a = p and b = 1 — p to write 

n ^ / \ 

Ep(fc) = E( 

k—O k—0 ^ 

= {p+i-pr 

= 1 . 

Finding moments is trickier here, and we shall later develop a much 
easier way to do this using exponential transforms. Nonetheless, it provides 
some useful practice to compute an example sum, if only to demonstrate 
later how much work can be avoided! Finding the mean requires evaluation 
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of the sum 



= 









= V 

(n — 



n! 






= E 



fc=i 



{n — k)\{k— 1)! 
n! 

(n — k)\{k — 1)! 



p\l-p) 



n—k 



/(i-p) 



n—k 



The trick here is to recognize that the sum looks very much like the terms 
in the binomial theorem, but a change of variables is needed to get the 
binomial theorem to simplify things. Changing variables by defining I = 
k — 1, the sum becomes 



n— 1 I 

° g 1 ) 1 , 

which will very much resemble the binomial theorem with n — 1 replacing 
n if we factor out a p and an n: 

(n — 1)! 

= np(p+l-p)”“^ 

= np. (2.45) 

The second moment is messier, so its evaluation is postponed until simpler 
means are developed. 

The geometric pmf is handled using the geometric progression, usually 
treated in high school algebra and summarized in appendix B. From (B.4) 
in appendix B we have for any real a with |a| < 1 



OO 






1 

1 — a’ 



(2.46) 



which proves that the geometric pmf indeed sums to 1. 

Evaluation of the mean of the geometric pmf requires evaluation of the 



OO OO 

m = ^ kp{k) = ^ kp{\ — p)^~^ . 



sum 
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One may have access to a book of tables including this sum, but a use- 
ful trick can be used to evaluate the sum from the well-known result for 
summing a geometric series. The trick involves differentiating the usual 
geometric progression sum, as detailed in appendix B, where it is shown 
for any q G (0, 1) that 



OO I 

( 2 . 47 ) 

t'o (1 - 



Set q = 1 — p yields 



m = 



1 

P ' 



(2.48) 



A similar idea works for the second moment. From (B.7) of appendix B 
the second moment is given by 



,o) = Y^kMi-pr-^=p{-, + \) 

p p 



(2.49) 



and hence from (2.41) the variance is 



a 



2 




(2.50) 



As an example of a probability computation using a geometric pmf, 
suppose that {0,E,P) is a discrete probability space with O = Z+, T the 
power set of and P the probability measure induced by the geometric 
pmf with parameter p. Find the probabilities of the events F = {k : k > 10} 
and G = {k : k is odd }. Alternatively note that F = {10, 11, 12, ... } and 
G = (1, 3, 5, 7, . . . } (we consider only odd numbers in the sample space. 
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that is, only positive odd numbers). We have that 
P{F) = J2p{k) 



k^F 

oo 



fc^lO 

oo 

OO 

^-p fctTo 

oo 



fc-10 






= {l-pf 



where the suitable form of the geometric progression has been derived from 
the basic form (B.4). While we have concentrated on the calculus, this 
problem could be interpreted as a solution to a word problem. For example, 
suppose you arrive at the Stanford Post Office and you know that the 
probability of k people being in line is a geometric distribution with p = 1/2. 
What is the probability that there are at least ten people in line? From the 
solution just obtained the answer is (1 — .5)® = 2“®. 

To find the probability of an odd outcome, we proceed in the same 
general fashion to write 

P{G) = 

keG 

= p ^ (i-p)'^ 
fc^0,2,4,... 
oo 

= p^[(i-pff 
k—0 

= P = ^ 

l-(l-p)2 2-p' 

Thus in the English example of the post office lines, the probability of 
finding an odd number of people in line is 2/3. 
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Lastly we consider the Poisson pmf, again beginning with a verification 
that it is indeed a pmf. Consider the sum 

k—0 k—0 k—0 

Here the trick is to recognize the sum as the Taylor series expansion for an 
exponential, that is, 

\k 

e^ = V — 

^ fc! ’ 

fc=0 



whence 



^p(/c) = e-V = l, 



fc=0 



proving the claim. 

To evaluate the mean of the Poisson pmf, begin with 



kp{k) = 



k^-X 



X^e 






k^l 



k\ 



.-A 



E 






Change variables I = k — 1 and pull a A out of the sum to write 



k^O k^O 

X 



Recognizing the sum as this yields 

m = X. 



(2.51) 



The second moment is found similarly, but with more bookkeeping. Anal- 
ogous to the mean computation, 



°° \fcp-A 

,(2) _ e 



= E^^ 



k=l 



k\ 



°° yfcp-A 

k^2 

^ A^e-^ 
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Change variables I = k — 1 and pull out of the sum to obtain 



m' 



r(2) = 

1=0 

= A^ + A 

so that from (2.41) the variance is 






A'e 



m 



a" = A. 



(2.52) 



(2.53) 



Multidimensional pmf’s 

While the foregoing ideas were developed for scalar sample spaces such as 
Z_|_, they also apply to vector sample spaces. For example, if A is a discrete 
space, then so is the vector space A^ = {all vectors x = (xg, ■ ■ ■ Xk-i) with 
Xi € A, i = 0, 1, ■ ■ ■ ,k — 1}. A common example of a pmf on vectors is the 
product pmf of the following example. 

[2.15] The product pmf. 

Let Pi; i = 0,1, ■■ ■ , fc — 1, be a collection of one-dimensional pmf’s; 
that is, for each t = 0, 1, . . . , fc — 1 Pi{k); r G A satisfies (2.30) and 
(2.31). Define the product fc— dimensional pmf p on A^ by 

fc-i 

p{yi) = p(xo,Xi, . . . = Y[ Pii^i) • 

i=0 

As a more specific example, suppose that all of the marginal pmf’s are 
the same and are given by a Bernoulli pmf: 

p{x)=p^1-pY-^; x = 0,1. 

Then the corresponding product pmf for a k dimensional vector becomes 

k-l 

p{xo,xi,... ,xk-i) = np"’*(i 

i=0 

— p'w(xo,xi,... _ p^k-w{xQ,xi,... ,Xk-i) 

where w{xq,Xi, . . . ,Xk~i) is the number of ones occurring in the binary 
/c-tuple Xq,Xi, . . . ,Xk-i, the Hamming weight of the vector. 
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2.5 Continuous Probability Spaces 

Continuous spaces are handled in a manner analogous to discrete spaces, 
but with some fundamental differences. The primary difference is that 
usually probabilities are computed by integrating a density function instead 
of summing a mass function. The good news is that most formulas look 
the same with integrals replacing sums. The bad news is that there are 
some underlying theoretical issues that require consideration. The problem 
is that integrals are themselves limits, and limits do not always exist in the 
sense of converging to a finite number. Because of this, some care will be 
needed to clarify when the resulting probabilities are well defined. 

[2.14] Let (0,P) = (?ft, the real line together with its Borel field. 

Suppose that we have a real-valued function / on the real line that 
satisfies the following properties 

f(r) > 0 , all r€fl . (2.54) 



[ f(r)dr = 1 
Jn 



(2.55) 



that is, the function /(r) has a well-defined integral over the real line. 
Define the set function P by 



P(F) = J f(r) dr 



lF(r)f{r)dr , F G ,8(3?) . (2.56) 



We note that a probability space defined as a probability measure on a 
Borel field is an example of a Borel space. 

Again as in the discrete case, this integral is a special case of a more 
general weighted integral: Suppose that is a real-valued function defined 
on n, i.e., g : Lt ^ H assigns a real number g(r) to every r G Lt. Recall 
that such a function is called a random variable. Given a pdf /, define the 
expectation of g (with respect to /) as 

E(g) = J g(r)f(r)dr. (2.57) 

With this definition we can rewrite (2.56) as 

P(F) = E(1f), (2.58) 

which has exactly the same form as in the discrete case. Thus probabilities 
can be considered as expectations of indicator functions in both the dis- 
crete case where the probability measure is described by a pmf and in the 
continuous case if the probability measure is described by a pdf. 
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As in the discrete case, there are several particularly important examples 
of expectations if the sample space is a subset of the real line, e.g., 3? or 
[0,1). The definitions are exact integral analogs of those for the discrete 
cases: the mean or first moment 



m = J rf{r) dr, 



the fcth moment 



^(fe) = J 



including the second moment. 



= J r^f{r) dr, 



(2.59) 



(2.60) 



(2.61) 



the centralized moments formed by substracting the mean before taking the 
power: 



y (r — m)'^f{r) dr, 

including the variance 

= J i’’’ ~ ^)^/(^) dr. 

Often the fcth absolute moment is used instead: 

= mt|r|^/(r) dr. 



(2.62) 



(2.63) 



(2.64) 



As in the discrete case, the variance and the second moment are easily 
related as 



cr^ = — m^. (2.65) 

An important technical detail not yet considered is whether or not the 
set function defined as an integral over a pdf is actually a probability mea- 
sure. In particular, are the probabilities of all events well defined and do 
they satisfy the axioms of probability? Intuitively this should be the case 
since (2.54) to (2.56) are the integral analogs of the summations of (2.30) 
to (2.32) and we have argued that summing pmf’s provides a well-defined 
probability measure. In fact, this is mathematically a delicate issue which 
leads to the reasons behind the requirements for sigma-fields and Borel 
fields. Before exploring these issues in more depth in the next section, the 
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easy portion of the answer should be recalled: We have already argued in 
the introduction to this chapter that if we define a set function P{F) as the 
integral of a pdf over the set F, then if the integral exists for the sets in 
question, the set function must be nonnegative, normalized, and additive, 
that is, it must satisfy the first three axioms of probability. This is well 
and good, but it leaves some key points unanswered. First, is the candi- 
date probability measure defined for all Borel sets? I.e., are we guaranteed 
that the integral will make sense for all sets (events) of interest? Second, 
is the candidate probability measure also countably additive or, equiva- 
lently, continuous from above or below? The answer to both questions is 
unfortunately no if one considers the integral to be a Riemann integral, the 
integral most engineers learn as undergraduates. The integral is not certain 
to exist for all Borel sets, even if the pdf is a simple uniform pdf. Riemann 
integrals in general do not have nice limiting properties, so the necessary 
continuity properties do not hold in general for Rieman integrals. These 
delicate issues are considered next in an optional subsection and further in 
appendix B, but the bottom line can be easily summarized as follows. 

• Eq. (2.56) defines a probability measure on the Borel space of the 
real line and its Borel sets provided that the integral is interpreted as 
a Lebesgue integral. In all practical cases of interest, the Lebesgue 
integral is either equal to the Riemann integral, usually more famil- 
iar to engineers, or to a limit of Riemann integrals of a converging 
sequence of sets. 

^Probabilities as Integrals 

The first issue is fundamental: Does the integral of (2.56) make sense; i.e., 
is it well-defined for all events of interest? Suppose first that we take the 
common engineering approach and use Riemann integration — the form 
of integration used in elementary calculus. Then the above integrals are 
defined at least for events F that are intervals. This implies from the 
linearity properties of Riemann integration that the integrals are also well- 
defined for events F that are finite unions of intervals. It is not difficult, 
however, to construct sets F for which the indicator function Ij? is so nasty 
that the function /(r)lF(r) does not have a Riemann integral. For example, 
suppose that f{r) is 1 for r G [0, 1] and 0 otherwise. Then the Riemann 
integral f lf’(r)f(r) dr is not defined for the set F of all irrational numbers, 
yet intuition should suggest that the set has probability 1. This intuition 
reflects the fact that if all points are somehow equally probable, then since 
the unit interval contains an uncountable infinity of irrational numbers and 
only a countable infinity of rational numbers, then the probability of the 
former set should be one and that of the latter 0. This intuition is not 
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reflected in the integral deflnition, which is not deflned for either set by the 
Riemann approach. Thus the deflnition of (2.56) has a basic problem: The 
integral in the formula giving the probability measure of a set might not 
be well-defined. 

A natural approach to escaping this dilemma would be to use the Rie- 
mann integral when possible, i.e., to define the probabilities of events that 
are finite unions of intervals, and then to obtain the probabilities of more 
complicated events by expressing them as a limit of finite unions of inter- 
vals, if the limit makes sense. This would hopefully give us a reasonable 
deflnition of a probability measure on a class of events much larger than the 
class of all finite unions of intervals. Intuitively, it should give us a proba- 
bility measure of all sets that can be expressed as increasing or decreasing 
limits of finite unions of intervals. 

This larger class is, in fact, the Borel held, but the Riemann integral 
has the unfortunate property that in general we cannot interchange limits 
and integration; that is, the limit of a sequence of integrals of converging 
functions may not be itself an integral of a limiting function. 

This problem is so important to the development of a rigorous proba- 
bility theory that it merits additional emphasis: even though the familiar 
Riemann integrals of elementary calculus suffice for most engineering and 
computational purposes, they are too weak for building a useful theory, 
proving theorems, and evaluating the probabilities of some events which 
can be most easily expressed as limits of simple events. The problems are 
that the Riemann integral does not exist for sufficiently general functions 
and that limits and integration cannot be interchanged in general. 

The solution is to use a different deflnition of integration — the Lebesgue 
integral. Here we need only concern ourselves with a few simple properties 
of the Lebesgue integral, which are summarized below. The interested 
reader is referred to appendix B for a brief summary of basic definitions and 
properties of the Lebesgue integral which reinforce the following remarks. 

The Riemann integral of a function /(r) “carves up” or partitions the 
domain of the argument r and effectively considers weighted sums of the 
values of the function /(r) as the partition becomes ever finer. Conversely, 
the Lebesgue integral “carves up” the values of the function itself and effec- 
tively defines an integral as a limit of simple integrals of quantized versions 
of the function. This simple change of deflnition results in two fundamen- 
tally important properties of Lebesgue integrals that are not possessed by 
Riemann integrals: 

1. The integral is deflned for all Borel sets. 

2. Subject to suitable technical conditions (such as integrands with bounded 
absolute value), one can interchange the order of limits and integra- 
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tion; e.g., if f F, then 



P{F) 



{r)f{r)dr = 



lim lp^(r)f{r)dr 

n — »-oo 



lim lF^{r)f{r)dr= lim , 



that is, (2.28) holds, and hence the set function is continuous from 
below. 



We have already seen that if the integral exists, then (2.56) ensures that 
the first three axioms hold. Thus the existence of the Lebesgue integral on 
all Borel sets coupled with continuity and the first three axioms ensures 
that a set function defined in this way is indeed a probability measure. 
We observe in passing that even if we confined interest to events for which 
the Riemann integral made sense, it would not follow that the resulting 
probability measure would be countably additive: As with continuity, these 
asymptotic properties hold for Lebesgue integration but not for Riemann 
integration. 

How do we reconcile the use of a Lebesgue integral given the assumed 
prerequisite of traditional engineering calculus courses based on the Rie- 
mann integral? Here a standard result of real analysis comes to our aid: If 
the ordinary Riemann integral exists, then so does the Lebesgue integral, 
and the two are the same. If the Riemann integral does not exist, then we 
can try to find the probability as a limit of probabilities of simple events 
for which the Riemann integrals do exist, e.g., as the limit of probabilities 
of finite unions of intervals. In other words, Riemann calculus will usually 
suffice for computation (at least if /(r) is Riemann integrable) provided we 
realize that we may have to take limits of Riemann integrals for compli- 
cated events. Observe, for example, that in the case mentioned where /(r) 
is 1 on [0, 1], the probability of a single point 1/2 can now be found easily 
as a limit of Riemann integrals: 




lp/2-e,l/2+e) 



dr = lim 2e = 0 , 



as expected. 

In summary, our engineering compromise is this: We must realize that 
for the theory to be valid and for (2.56) indeed to give a probability measure 
on subsets of the real line, the integral must be interpreted as a Lebesgue 
integral and Riemann integrals may not exist. For computation, however, 
one will almost always be able to find probabilities by either Riemann 
integration or by taking limits of Riemann integrals over simple events. 
This distinction between Riemann integrals for computation and Lebesgue 
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integrals for theory is analogous to the distinction between rational numbers 
and real numbers. Computational and engineering tasks use only arithmetic 
of finite precision in practice. However, in developing the theory irrational 
numbers such as \p2 and tt are essential. Imagine how hard it would be 
to develop a theory without using irrational numbers, and how unwise it 
would be to do so just because the eventual computations do not use them. 
So it is with Lebesgue integrals. 

Probability Density Functions 

The function / used in (2.54) to (2.56) is called a probability density function 
or pdf since it is a nonnegative function that is integrated to find a total 
mass of probability, just as a mass density function in physics is integrated 
to find a total mass. Like a pmf, a pdf is defined only for points in LI and 
not for sets. Unlike a pmf, a pdf is not in itself the probability of anything; 
for example, a pdf can take on values greater than one, while a pmf cannot. 
Under a pdf, points frequently have probability zero, even though the pdf 
is nonzero. We can, however, interpret a pdf as being proportional to a 
probability in the following sense. For a pmf we had 

p{x) = P{{x}) 

Suppose now that the sample space is the real line and that a pdf / is 
defined. Let F = [x,x + Ax), where Ax is extremely small. Then if / is 
sufficiently smooth, the mean value theorem of calculus implies that 

cic+Atc 

P([x,x + Ax))= / /(a) da « /(x)Ax, (2.66) 

J X 

Thus if a pdf /(x) is multiplied by a differential Ax, it can be interpreted 
as (approximately) the probability of being within Ax of x. 

Both probability functions, the pmf and the pdf, can be used to define 
and compute a probability measure: The pmf is summed over all points 
in the event, and the pdf is integrated over all points in the event. If the 
sample space is the subset of the real line, both can be used to compute 
expectations such as moments. 

Some of the most common pdf’s are listed below. As will be seen, these 
are indeed valid pdf’s, that is, they satisfy (2.54) and (2.55). The pdf’s are 
assumed to be 0 outside of the specified domain. &, a, A > 0, m, and cr > 0 
are parameters in 3?. 

The uniform pdf. Given b > a, f{r) = 1/(6 — a) for r € [a, 6]. 
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The exponential pdf. /(r) = Ae r > 0. 



The doubly exponential (or Laplacian) pdf. /(r) 

3 ?. 




; r G 



The Gaussian (or Normal) pdf. /(r) = (27rcr^)~^/^ exp( ); 

r G 3?. Since the density is completely described by two parameters: the 
mean m and variance > 0, it is common to denote it by a^). 



Other univariate pdf’s may be found in Appendix C. 

Just as we used a pdf to construct a probability measure on the space 
(3?, ,8(3?)), we can also use it to define a probability measure on any smaller 
space {A,B{A)), where A is a subset of 3?. 

As a technical detail we note that to ensure that the integrals all behave 
as expected we must also require that A itself be a Borel set of 3? so that 
it is precluded from being too nasty a set. Such probability spaces can be 
considered to have a sample space of either 3? or A, as convenient. In the 
former case events outside of A will have zero probability. 



Computational Examples 

This section is less detailed than its counterpart for discrete probability 
because generally engineers are more familiar with common integrals than 
with common sums. We confine the discussion to a few observations and 
to an example of a multidimensional probability computation. 

The uniform pdf is trivially a valid pdf because it is nonnegative and 
its integral is simply the length of the the interval on which it is nonzero, 
b — a, divided by the length. For simplicity consider the case where a = 0 
and 6 = 1 so that 6 — a = 1. In this case the probability of any interval 
within [0, 1) is simply the length of the interval. The mean is easily found 
to be 

the second moment is 

w = ^ r^dr=^\l = i, (2.68) 

and the variance is 



(2.69) 
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The validation of the pdf and the mean, second moment, and variance 
of the exponential pdf can be found from integral tables or by the inte- 
gral analog to the corresponding computations for the geometric pmf, as 
described in appendix B. In particular, it follows from (eq:expint) that 



from (B.IO) that 



and 



and hence from (2.65) 



dr = l, 



1 

m= rXe~^^ dr = — 

.In X 



^(2)= r r^Xe-^^ dr 

JO 



A2 A2 A2 ■ 



(2.70) 



(2.71) 



(2.72) 



(2.73) 



The moments can also be found by integration by parts. 

The Laplacian pdf is simpy a mixture of an exponential pdf and its 
reverse, so its properties follow from those of an exponential pdf. The 
details are left as an exercise. 

The Gaussian pdf example is more involved. In appendix B, it is shown 
(in the development leading up to (B.15) that 





dx = 1. 



(2.74) 



It is reasonably easy to find the mean by inspection. The function g{x) = 

{x — m)e ^ is an odd function, i.e., it has the form g{—x) = —g{x), 
and hence its integral is 0 if the integral exists at all. 

This means that 

dx = m (2-75) 




The second moment and variance are most easily handled by the transform 
methods to be developed in Chapter 4 and their evaluation will be deferred 
until then, but we observe that the parameter which we have called the 
variance is in fact the variance, i.e.. 
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Computing probabilities with the various pdf’s varies in difficulty. For 
simple pdf’s one can easily find the probabilities of simple sets like intervals. 
For example, with a uniform pdf on [a, 6], then for any a < c < d < b 
Pr([c, d]) = {d — c) /{b — a), the probability of an interval is proportional to 
the length of the integral. For the exponential pdf, the probability of an 
interval [c, d] , 0 < c < d, is given by 

nd 

Pr([c,d])= / Xe~^^ dx = e~^^ (2.77) 

J C 

The Gaussian pdf does not yield nice closed form solutions for the proba- 
bilities of simple sets like intervals, but it is well tabulated. Unfortunately 
there are several variations of how these tables are constructed. The most 
common forms are the 4> function 

4)(a) = — / e 2 (2.78) 

V 27T j — oc 



which is the probability of the simple event (— oo,a] = {x \ x < a\ for 
a zero mean unit variance Gaussian pdf A/”(0, 1). The Q function is the 
complementary function 

Q(a') = — / e~~ du = 1 — 4)(a). (2.79) 

\/ Stt j 

The Q function is used primarily in communications systems analysis where 
probabilities of exceeding a threshold describe error events in detection 
systems. The error function is defined by 

9 

erf(a) = —= / du (2.80) 

V7T Jo 

and it is related to the Q and functions by 

Q(a) = i(l-erf(^) = l-4>(a). (2.81) 



Thus, for example, the probability of the set (— oo,a) for a J\f{m,cr^) 
pdf is found by changing variables u = {x — m)l a to be 



P{{x : X < a}) 



1-00 

1 



1 _ (x-m) 

e 






^ dx 
e~T dx 



(7 a 



(2.82) 
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The probability of an interval (a, 6] is then given by 

h — 771 n — m 

P{{a, b]) = P((-oo, b]) - P((-oo, a]) = (2.83) 

a a 



Observe that the symmetry of a Gaussian density implies that 



1 - $(a) = $(-a). 



(2.84) 



As a multidimensional example of probability computation, suppose 
that the sample space is 3?^, the space of all pairs of real numbers. The 
probability space consists of this sample space, the corresponding Borel 
field, and a probability measure described by a pdf 



f{x,y) 



\^e X G [0,oo), y G [0,oo) 

0 otherwise 



What is the probability of the event F = {{x,y) : x < y}7 As an inter- 
pretation, the sample points (x,y) might correspond to the arrival times of 
two distinct types of particle at a sensor following its activation, say type 
A and type B for x and y, respectively. Then the event is the event that a 
particle of type A arrives at the sensor before one of type B. Computation 
of the probability is then accomplished as 



P{F) 



(x,y)-.{x,y)GF 



f{x,y) dxdy 



{x,y):x'>0,y'>0,x<.y 



dxdy. 



This integral is a two-dimensional integral of its argument over the indicated 
region. Correctly describing the limits of integration is often the hardest 
part of computing probabilities. Note in particular the inclusion of the facts 
that both X and y are nonnegative (since otherwise the pdf is 0). The x < y 
region for nonnegative x and y is most easily envisioned as the region of 
the first quadrant lying above the line a; = y, if a; and y correspond to the 
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horizontal and vertical axes, respectively. Completing the calculus: 
P{F) = dy(^J\xe-^^e-f^y^ 



A/x J dye 

nOO 



dxe 



— \x 



poo 1 

Xli dye-^^y-{l-e~^y) 

Jo ^ 



= d 



= 1 - 



d 



dye — dye 

Jo 
X 



-O+Ovj 



/X + A /i + A 

Mass Functions as Densities 



As in systems theory, discrete problems can be considered as continuous 
problems by with the aid of the Dirac delta or unit impulse S(t), a gener- 
alized function or singularity function (also, unfortunately, called a distri- 
bution) with the property that for any smooth function {g{r); r e 3?} and 
any a G 3? 

J g{r)5{r - a)dr = g{a). (2.85) 

Given a pmf p defined on a subset of the real line 12 C 3?, we can define a 
pdf / by 

= -u). ( 2 . 86 ) 

This is indeed a pdf since 

J f{r)dr = J (^p{uj)S{r - u})^ dr 

= ’^^p{u;) J S{r — uj) dr 
= 

In a similar fashion, probabilies are computed as 

J ^F{r)f{r)dr = J Ipir) (^p{ui)S{r - ui)'j dr 
= ^p(w) J lF{r)S{r — oj) dr 
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Given that discrete probability can be handled using the tools of contin- 
uous probability in this fashion, it is natural to inquire why not use pdf’s 
in both the discrete and continuous case. The main reason is simplicity, 
pmf’s and sums are usually simpler to handle and evaluate than pdf’s and 
integrals. Questions of existence and limits rarely arise, and the notation is 
simpler. In addition, the use of Dirac deltas assumes the theory of gener- 
alized functions in order to treat integrals involving Dirac deltas as if they 
were ordinary integrals, so additional mathematical machinery is required. 
As a result, this approach is rarely used in genuinely discrete problems. 
On the other hand, if one is dealing with a hybrid problem that has both 
discrete and continuous components, then this approach may make sense 
because it allows the use of a single probability function, a pdf, throughout. 

Multidimensional pdf’s 

By considering multidimensional integrals we can also extend the construc- 
tion of probabilities by integrals to finite-dimensional product spaces, e.g., 

Given the measurable space (3?^, say we have a real-valued func- 

tion / on i?* with the properties that 

/(x) > 0 ; all X = (xcxi, . . . ,Xfe_i) e 3?'= , (2.87) 




/(x)dx = 1 . 



Then define a set function P by 



( 2 . 88 ) 



P{F) = [ /(x)dx all F £ , (2.89) 

Jf 

where the vector integral is shorthand for the fc— dimensional integral, that 
is, 



P{F) = / f{xo,xi,... ,Xk-i)dxodxi...dxk-i . 

J (xo,xi,... ,Xk-i)eF 

Note that (2.87) to (2.89) are exact vector equivalents of (2.54) to (2.56). 
As with multidimensional pmf’s, a pdf is not itself the probability of any- 
thing. As in the scalar case, however, the mean value theorem of calculus 
can be used to interpret the pdf as being proportional to the probability of 
being in a very small region around a point, i.e., that 

P({(ao, oi, . ■ . , Offc-i) : Xi < ai < Xi + Ai] z = 0, 1, . . . , n - 
« xi,... , a;fc_i)AoAi • • • A„_i. (2.90) 
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Is P defined by (2.89) a probability measure? The answer is a qualified 
yes with exactly the same qualifications as in the one-dimensional case. 

As in the one-dimensional sample space, a function / with the above 
properties is called a probability density function or pdf. To be more 
concise we will occasionally refer to a pdf on fc— dimensional space as a 
fc— dimensional pdf. 

There are two common and important examples of fc— dimensional pdf’s. 
These are defined next. In both examples the dimension k of the sample 
space is fixed and the pdf’s induce a probability measure on (3?^, ,8(3?)*) 
by (2.89). 

[2.16] The product pdf. 

Let /i; i = 0, 1, . . . , fc — 1, be a collection of one-dimensional pdf’s; 
that is, /i(r); r G 3? satisfies (2.54) and (2.55) for each i = 0, 1, . . . ,k— 
1. Define the product fc— dimensional pdf / by 

fc-i 

/(x) = /(Xo, Xi, . . . , Xk-l) = /i(Xi) . 

i=0 

The product pdf in fc— dimensional space is simply the product of fc 
pdf’s on one-dimensional space. The one-dimensional pdf’s are called the 
marginal pdf’s, and the multidimensional pdf is sometimes called a joint 
pdf. It is easy to verify that the product pdf integrates to I. 

The case of greatest importance is when all of the marginal pdf’s are 
identical, that is, when fi{r) = fo{r) for all i. Note that any of the pre- 
viously defined pdf’s on 3? yield a corresponding multidimensional pdf by 
this construction. In a similar manner we can construct pmf’s on discrete 
product spaces as a product of marginal pmf’s. 

[2.17] The multidimensional Gaussian pdf. 

Let m = (mo,TOi,... ,mk-iY denote a column vector (the super- 
script t stands for “transpose”). Let A denote a fc by fc square matrix 
with entries {Xij] i = 0, 1, . . . , fc — 1; j = 0, 1, . . . , fc — 1}. Assume that 
A is symmetric; that is, that A* = A or, equivalently, that Xij = Xj^t, 
all i,j- Assume also that A is positive definite; that is, for any nonzero 
vector y G 3?* the quadratic form y‘Ay is positive, that is, 

k-l k-1 

y‘Ay = > 0 • 

i—0 j—0 

a multidimensional pdf is said to be Gaussian if it has the following 
form for some vector m and matrix A satisfying the above conditions: 

/(x) = ( 27 T)-*/ 2 (det A)-l/ 2 g-l/ 2 (x-m)*A^bx-m) . ^ G 3?* . 
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where det A is the determinant of the matrix A. 



Since the matrix A is positive definite, the inverse of A exists and hence 
the pdf is well defined. It is also necessary for A to be positive definite 
if the integral of the pdf is to be finite. The Gaussian pdf may appear 
complicated, but it will later be seen to be one of the simplest to deal with. 
We shall later develop the significance of the vector m and matrix A. Note 
that if A is a diagonal matrix, example [2.17] reduces to a special case of 
example [2.16]. 

The reader must either accept on faith that the multidimensional Gaus- 
sian pdf integrates to 1 or seek out a derivation. 

The Gaussian pdf can be extended to complex vectors if the constraints 
on A are modified to require that A* = A, where the asterix denotes conju- 
gate transpose, and where for any vector y not identically 0 it is required 
that y*Ay > 0. 



[ 2 . 18 ] Mixtures. 

Suppose that Pi, i = 1,2, .. . ,oo is a collection of probability mea- 
sures on a common measurable space (n,lF), and let Oj, i = 1, 2, . . . 
be nonnegative numbers that sum to 1. Then the set function deter- 
mined by 

OO 

P{F) = '£a,R{F) 

i=l 

is also a probability measure on (n,lF). This relation is usually ab- 
breviated to 

OO 

P = ^ QiPi . 
i=l 

The first two axioms are obviously satisfied by P, and countable ad- 
ditivity follows from the properties of sums. (Finite additivity is easily 
demonstrated for the case of a finite number of nonzero a^.) A probability 
measure formed in this way is called a mixture. Observe that this con- 
struction can be used to form a probability measure with both discrete and 
continuous aspects. For example, let O be the real line and T the Borel 
field; suppose that / is a pdf and p is a pmf; then for any A G (0,1) the 
measure P defined by 



P{F) 



+ (1 - A) 

xeF 




f{x)dx 



combines a discrete portion described by p and a continuous portion de- 
scribed by /. Expectations can be computed in a similar way. Given a 
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function g, 



E{g) = A ^ g{x)p{x) + (1 - A) f g{x)f{x)dx 
xeF JxeF 

Note that this construction works for both scalar and vector spaces. 
This combination of discrete and continuous attributes is one of the main 
applications of mixtures. Another is in modeling a random process where 
there is some uncertainty about the parameters of the experiment. For 
example, consider a probability space for the following experiment: First 
a fair coin is flipped and a 0 or 1 (tail or head) observed. If the coin toss 
results in a 1, then a fair die described by a uniform pmf pi is rolled, and 
the outcome is the result of the experiment. If the coin toss results in a 
0, then a biased die described by a nonuniform pmf p 2 is rolled, and the 
outcome is the result of the experiment. The pmf of the overall experiment 
is then the mixture pi/2+p2/2. The mixture model captures our ignorance 
of which die we will be rolling. 



2.6 Independence 

Given a probability space {fl,IF,P), two events F and G are defined to 
be independent if P{F n G) = P{F)P{G). A collection of events {Fp, i = 
0, 1, . . . , A: — 1} is said to be independent or mutually independent if for any 
distinct subcollection {F).; t = 0, 1, . . . , m — 1}, < A:, we have that 

( m — 1 \ m—1 

n '"d = n piAi ■ 

i=0 / i=0 

In words: the probability of the intersection of any subcollection of the given 
events equals the product of the probabilities of the separate events. Unfor- 
tunately it is not enough to simply require that P = 0^=0^ 

as this does not imply a similar result for all possible subcollections of 
events, which is what will be needed. For example, consider the following 
case where P{Ff]Gr]H) = P{F)P{G)P{H) for three events U, G, and H, 
yet it is not true that P{F n G) = P{F)P{G) 

P{F) = P(G) = P(iF) = i 

p{FnGnH) = ^ = p{f)p{G)p{h) 

p{FnG) = P{G n H) = P{F n H) 



= ^^P{F)P{G). 
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The example places zero probability on the overlap F n G except where it 
also overlaps H, i.e., P(FnGni7'^) = 0. Thus in this case P(FnGni7) = 
P{F)P{G)P{H) = 1/27, but P{F n G) = 1/27 P{F)P{G) = 1/9. 

The concept of independence in the probabilistic sense we have defined 
relates easily to the intuitive idea of independence of physical events. For 
example, if a fair die is rolled twice, one would expect the second roll 
to be unrelated to the first roll because there is no physical connection 
between the individual outcomes. Independence in the probabilistic sense 
is reflected in this experiment. The probability of any given outcome for 
either of the individual rolls is 1/6. The probability of any given pair of 
outcomes is (1/6)^ = 1/36 — the addition of a second outcome diminishes 
the overall probability by exactly the probability of the individual event, 
viz., 1/6. Note that the probabilities are not added — the probability of 
two successive outcomes cannot reasonably be greater than the probability 
of either of the outcomes alone. Do not, however, confuse the concept of 
independence with the concept of disjoint or mutually exclusive events. If 
you roll the die once, the event the roll is a one is not independent of 
the event the roll is a six. Given one event, the other cannot happen — 
they are neither physically nor probabilistically independent. These are 
mutually exclusive events. 



2.7 Elementary Conditional Probability 

Intuitively, independence of two events means that the occurrence of one 
event should not affect the occurrence of the other. For example, the knowl- 
edge of the outcome of the first roll of a die should not change the probabil- 
ities for the outcome of the second roll of the die if the die has no memory. 
To be more precise, the notion of conditional probability is required. Con- 
sider the following motivation. Suppose that (fl, F, P) is a probability space 
and that an observer is told that an event G has already occurred. The 
observer thus has a posteriori knowledge of the experiment. The observer 
is then asked to calculate the probability of another event F given this in- 
formation. We will denote this probability of F given G by P{F\G). Thus 
instead of the a priori or unconditional probability P{F), the observer 
must compute the a posteriori or conditional probability P{F\G), read 
as “the probability that event F occurs given that the event G occurred.” 
For a fixed G the observer should be able to And P{F\G) for all events 
F, thus the observer is in fact being asked to describe a new probability 
measure, say Pq, on (n,lF). How should this be defined? Intuition will 
lead to a useful definition and this definition will indeed provide a useful 
interpretation of independence. 
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First, since the observer has been told that G has occurred and hence 
uj € G, clearly the new probability measure Pq must assign zero probability 
to the set of all uj outside of G, that is, we should have 



or, equivalently. 


P{G^\G) = 0 


(2.91) 




P{G\G) = 1. 


(2.92) 


Eq. (2.91) plus the axioms of probability in turn imply that 




P{F\G) 


= P(Pn(GUG‘=)|G) = P(FnG|G). 


(2.93) 



Second, there is no reason to suspect that the relative probabilities within 
G should change because of the conditioning. For example, if an event 
F G G \s twice as probable as an event H C G with respect to P, then the 
same should be true with respect to Pq- For arbitrary events F and H, 
the events F D G and H OG are both in G, and hence this preservation of 
relative probability implies that 

P{FDG\G) P{FnG) 

P{HnG\G) ~ P{HOG)' 

But if we take H = O in this formula and use (2.92)-(2.93), we have that 

P{F\G) = P{FnG\G) = ^^pl^^\ (2.94) 

which is in fact the formula we now use to define the conditional probability 
of the event F given the event G. The conditional probability can be 
interpreted as “cutting down” the original probability space to a probability 
space with the smaller sample space G and with probabilities equal to the 
renormalized probabilities of the intersection of events with the given event 
G on the original space. 

This definition meets the intuitive requirements of the derivation, but 
does it make sense and does it fulfill the original goal of providing an inter- 
pretation for independence? It does make sense provided P{G) > 0, that 
is, the conditioning event does not have zero probability. This is in fact the 
distinguishing requirement that makes the above definition work for what is 
known as elementary conditional probability. Non-elementary conditional 
probability will provide a more general definition that will work for condi- 
tioning events having zero probability, such as the event that a fair spin of 
a pointer results in a reading of exactly I/tt. Further, if P is a probability 
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measure, then it is easy to see that Pq defined by Pg{F) = P{F\G) for 
F G E is also a probability measure on the same space (remember G stays 
fixed), i.e., Pq is a normalized and countably additive function of events. 
As to independence, suppose that F and G are independent events and 
that P{G) > 0, then 



PiF\G) = ^^^^ = P{F), 

the probability of F is not effected by the knowledge that G has occurred. 
This is exactly what one would expect from the intuitive notion of the 
independence of two events. Note, however, that it would not be as useful 
to define independence of two events by requiring P{F) = P(F\G) since it 
would be less general than the product definition; it requires that one of 
the events have a nonzero probability. 

Conditional probability provides a means of constructing new probabil- 
ity spaces from old ones by using conditional pmf’s and elementary condi- 
tional pdf’s. 



[ 2 . 18 ] Suppose that {fl,F,P) is a probability space described by a pmf p 
and that A is an event with nonzero probability. Then the pmf pA 
defined by 



Pa{uj) 



|^ = P(M|A), 0.GA 
0 oj ^ A 



is a pmf and implies a probability space (fl,F, Pa), where 



Pa{F) = (2.95) 

uieF 

= P{F\A). (2.96) 

PA is called a conditional pmf . More specifically, it is the conditional 
pmf given the event A. In some cases it may be more convenient 
to define the conditional pmf on the sample space A and hence the 
conditional probability measure on the original event space. 

As an example, suppose that p is a geometric pmf and that A = {u> : 
Lo > Kf = {K, K + 1, . . .}. In this case the conditional pmf given 
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that the outcome is greater than or equal to K is 

EZk^-pY-^p 

{1-pZ~^P 

(1 

{1-pZ~^P-, k = K + l,K + 2,... , (2.97) 

(2.98) 

which can be recognized as a geometric pmf which begins at k = K+1. 



PA{k) = 



[2.19] Suppose that (fi, P, P) is a probability space described by a pdf / 
and that A is an event with nonzero probability. Then the/^ defined 
by 



/.4(w) 



f IC) 
J P{A) 

lo 



LO ^ A 
to ^ A 



is a pdf on A and describes a probability measure 



Pa{F) = [ fA{to)du (2.99) 

= P(F\A). (2.100) 

/a is called an elementary conditional pdf (given the event A). The 
word “elementary” reflects the fact that the conditioning event has 
nonzero probability. We will later see how conditional probability can 
be usefully extended to conditioning on events of zero probability. 

As a simple example, consider the continuous analog of the previous 
conditional geometric pmf example. Given an exponential pdf and A = 
{r : r > c}, define 



fA{x) 



\e~^y dy 

g-Ac 

x>c, 



(2.101) 

( 2 . 102 ) 



which can be recognized as an exponential pdf that starts at c. The ex- 
ponential pdf and geometric pmf share this unusual property, conditioning 
on the output being larger than some number does not change the basic 
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form of the pdf or pmf, only its starting point. This has the discouraging 
implication that if, for example, the time for the next arrival of a bus is 
described by an exponential pdf, then knowing you have already waited for 
an hour does not change your pdf to the next arrival from what it was when 
you arrived. 



2.8 Problems 

1. Suppose that you have a set function P defined for all subsets F C ft 
of a sample space Lt and suppose that you know that this set function 
satisfies (2. 7-2. 9). Show that for arbitrary (not necessarily disjoint) 
events, 

P{F U G) = P{F) + P{G) - P{F n G) . 

2. Describe the sigma-field of subsets of 3? generated by the points or 
singleton sets. Does this sigma-field contain intervals of the form 
(a, b) for b > a? 

3. Given a finite subset A of the real line 3?, prove that the power set of 
A and B{A) are the same. Repeat for a countably infinite subset of 
3?. 

4. Given that the discrete sample space Lt has n elements, show that the 
power set of Lt consists of 2" elements. 

5. *Let n = 3?, the real line, and consider the collection T of subsets of 
3? defined as all sets of the form 

k m 

z— 0 j—0 

for all possible choices of nonnegative integers k and m and all possible 
choices of real numbers at < bi, Ci < di. If fc or m is 0, then the 
respective unions are defined to be empty so that the empty set itself 
has the form given. In other words, F contains all possible finite 
unions of half-open intervals of this form and complements of such 
half-open intervals. Every set of this form is in F and every set in 
F has this form. Prove that IF is a field of subsets of Pi. Does F 
contain the points? For example, is the singleton set {0} in FI Is F 
a sigma-field? 

6. Let PI = [0,oo) be a sample space and let F be the sigma-field of 
subsets of Pi generated by all sets of the form (n, n+ 1) forn = 1,2,... 




76 



CHAPTER 2. PROBABILITY 



(a) Are the following subsets of O in IF? (i) [0, oo), (ii) = {0, 1 , 2 ,...}, 
(iii) [0, fc] U [/c + l,oo) for any positive integer k, (iv) {k} for 
any positive integer k, (v) [0, k] for any positive integer k, (vi) 
(1/3, 2). 

(b) Define the following set function on subsets of Lt : 

P{F) = c ^ 3-* 

i£Z+:i+l/2£F 

(If there is no i for which i + 1/2 G F, then the sum is taken as 
zero.) Is P a probability measure on (12, IF) for an appropriate 
choice of c? If so, what is c? 

(c) Repeat part (b) with B, the Borel field, replacing T as the event 
space. 

(d) Repeat part (b) with the power set of [0, oo) replacing T as the 
event space. 

(e) Find P(P) for the sets F considered in part (a). 

7. Show that an equivalent axiom to 2.3 of probability is the following: 

If F and G are disjoint, then P{F U G) = P{F') + P(G) , 

that is, we really need only specify finite additivity for the special 
case of n = 2. 

8. Consider the measurable space ((0, 1], P([0, 1])). Define a set function 
P on this space as follows: 

( 1/2 if 0 G P or 1 G F but not both 
P{F) = } 1 if 0 G P and 1 G P 

[ 0 otherwise . 

Is P a probability measure? 

9. Let 5 be a sphere in 3?^ : S = {{x,y,z) : x'^ + y'^ + z'^ < r^}, 
where r is a fixed radius. In the sphere are fixed N molecules of gas, 
each molecule being considered as an infinitesimal volume (that is, 
it occupies only a point in space). Define for any subset P of 5 the 
function 

//{F) = {the number of molecules in P| . 

Show that P(P) = //{F)/N is a probability measure on the measur- 
able space consisting of S and its power set. 
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10. ^Suppose that you are given a probability space {O, T, P) and that a 
collection Tp of subsets of Lt is defined by 

TFp = {F\J N] all F G E, all N C G for which G £ T and P{G) = 0}. 

(2.103) 

In words: Pp contains every event in P along with every subset N 
which is a subset of zero probability event G & P, whether or not N 
is itself an event (a member of P). Thus Pp is formed by adding any 
sets not already in Pp which happen to be subsets of zero probability 
events. We can define a set function P for the measurable space 
(n,Pp) by 

P{F UN)= P{F) 17 F e P and N C G e P, where P{G) = 0. 

(2.104) 

Show that {fl,Pp,P) is a probability space, i.e., you must show that 
Pp is an event space and that P is a probability measure. A prob- 
ability space with the property that all subsets of zero probability 
events are also events is said to be complete and the probability space 
(n, Pp, P) is called the completion of the probability space (fl, P , P). 

In problems 2.11 to 2.17 let (Lt,P,P) be a probability space and 
assume that all given sets are events. 

11. If G C F, prove that P{F — G) = P{F) — P{G). Use this fact to prove 
that if G C F, then F(G) < P{F). 

12. Let {Fi} be a countable partition of a set G. Prove that for any event 
H, 

P{H n F,) = P{H n G) . 

i 

13. If {Fi, i = 1,2,...} forms a partition of 0 and {Gp, i = 1,2,...} 
forms a partition of LI, prove that for any H, 

OO OO 

P{H n Fi n Gj) . 

i=i j=i 

14. Prove that |F(F) - F(G)| < F(FAG). 

15. Prove that F(F U G) < F(F) -|- P{G). Prove more generally that for 
any sequence (i.e., countable collection) of events Fi, 

( OO \ OO 

/ i=l 



p 
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This inequality is called the union hound or the Bonferoni inequality. 
{Hint: Use problem A. 2 or 2.1.) 

16. Prove that for any events F, G, and H, 

P{FAG) < P{FAH) + P{HAG) . 

In words: If the probability of the symmetric difference of two events 
is small, then the two events must have approximately the same prob- 
ability. The astute observer may recognize this as a form of the tri- 
angle inequality; one can consider P{F AG) as a distance or metric 
on events. 

17. Prove that if P{F) >1 — 5 and P{G) >1 — 5, then also P{F H G) > 
1 — 25. In other words, if two events have probability nearly one, then 
their intersection has probability nearly one. 

18. *The Cantor set Consider the probability space (fl,B(fl),P) where 
P is described by a uniform pdf on U = [0, 1). Let Fi = (1/3, 2/3), 
the middle third of the sample space. Form the set Gi = Lt — Fi 
by removing the middle third of the unit interval. Next define F 2 
as union of the middle thirds of all of the intervals in G\, i.e., F 2 = 
(1/9, 2/9) lJ(7/9, 8/9). Define G 2 as what remains when remove F 2 
from Gi, that is. 



G2 = Gi-F2 = [0,1]-(Fi|JF2). 

Continue in this manner. At stage n is the union of the middle 
thirds of all of the intervals in G„_i = [0, 1] — Ufc=i ^n- The Cantor 
set is defined as the limit of the G„, that is, 

00 00 

G= f|G„ = [0,l]- |JU„. (2.105) 

n—1 n—1 

(a) Prove that G G B{Lt), i.e., that it is an event. 

(b) Prove that 

i(^)"-i; n = l,2,... . (2.106) 

(c) Prove that P{C) = 0, i.e., that the Cantor set has zero proba- 
bility. 
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One thing that makes this problem interesting is that unlike most 
simple examples of nonempty events with zero probability, the Cantor 
set has an uncountable infinity of points and not a discrete set. This 
can be shown be first showing that a point x G C if and only if the 
point can be expressed as a ternary number x = where 

all the a„ are either 0 or 2. Thus the number of points in the Cantor 
set is the same as the number of real numbers that can be expressed 
in this fashion, which is the same as the number of real numbers that 
can be expressed in a binary expansion (since each a„ can have only 
two values), which is the same as the number of points in the unit 
interval, which is uncountably infinite. 

19. Six people sit at a circular table and pass around and roll a single fair 
die (equally probable to have any face 1 through 6 showing) beginning 
with person ^ 1. The game continues until the first 6 is rolled, the 
person who rolled it wins the game. What is the probability that 
player ^ 2 wins? 

20. Show that given (2.22) through (2.24), (2.28) or (2.29) implies (2.25). 
Thus (2.25), (2.28), and (2.29). provide equivalent candidates for the 
fourth axiom of probability. 

21. Suppose that P is a probability measure on the real line and define the 
sets Fn = (0, 1/n) for all positive integer n. Evaluate lim„^oo P(P«)- 

22. Answer true or false for each of the following statements. Answers 
must be justified. 

(a) The following is a valid probability measure on the sample space 
n = {1, 2, 3, 4, 5, 6} with event space fF = all subsets of Ll\ 

= all PGP. 

ieF 

(b) The following is a valid probability measure on the sample space 
fl = {1, 2, 3, 4, 5, 6} with event space F = all subsets of Ll\ 

= if2GPor6GP 
[ 0 otherwise 

(c) If P(G U P) = P(P) + P{G), then P and G are independent. 

(d) P(P|G) > P(G) for all events P and G. 

(e) Mutually exclusive (disjoint) events with nonzero probability 
cannot be independent. 
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(f) For any finite collection of events Ff, i = 1,2, ■ ■ ■ ,N 

N 

23. Prove or provide a counterexample for the relation P{F\G)-\-P{F\G^) = 
P{F). 

24. Find the mean, second moment, and variance of a uniform pdf on an 
interval [a,b). 

25. Given a sample space = {0, 1, 2, • • • } define 

fc = 0,l,2, ••• 

(a) What must 7 be in order for p{k) to be a pmf? 

(b) Find the probabilities P({0, 2,4, 6, • • • }), P({1, 3, 5, 7, • • • }), and 
P({0,1,2,3,4,... ,20}). 

(c) Suppose that K is a, fixed integer. Find P({0, K, 2K, 3K, ...}). 

(d) Find the mean, second moment, and variance of this pmf. 

26. Suppose that p(/c) is a geometric pmf. Define (/(fc) = {p{k)+p{—k))/2. 
Show that this is a pmf and find its mean and variance. Find the 
probability of the sets {k : \k\ > K} and {k : k is a multiple of 3}. 
Find the probability of the sets {k : k is odd } 

27. Define a pmf p{k) = /\k\\ for k € Z. Evaluate the constant C 

and find the mean and variance of this pmf. 

28. A probability space consists of a sample space 0 = all pairs of positive 
integers (that is, O = {1,2,3,...}^) and a probability measure P 
described by the pmf p defined by 

=p2(i_p)fc+™-2 , 

(a) Find P{{{k,m) : k > m|). 

(b) Find the probability P({(fc, m) : k + m = rj) as a function of r 
for r = 2, 3, . . . Show that the result is a pmf. 

(c) Find the probability P{{{k,m) : A: is an odd number}). 

(d) Define the event F = {{k,m) : k > m}. Find the conditional 
pml pF{k,m) = P{{k,m}\F). Is this a product pmf? 




2.8. PROBLEMS 



81 



29. Define the uniform probability density function on [0, 1) in the usual 
way as 

{ 1 0 < r < 1 

0 otherwise 

(a) Define the the set F = {0.25,0.75}, a set with only two points. 
What is the value of 

f f{r) drl 

JF 

The Riemann integral is well defined for a finite collection of 
points and this should be easy. What is /(?") drl 
(b) Now define the set F as the collection of all rational numbers 
in [0,1), that is, all numbers that can be expressed as k/n for 
some integers 0 < k < n. What is the integral /p /(r) drl Is 
it defined? Thinking intuitively, what should it be? Suppose 
instead you consider the set F‘^, the set of all irrational numbers 
in [0, 1). What is /(r) drl 

30. Given the uniform pdf on [0, 1], /(x) = 1; x G [0, 1], find an expression 
for P{{a,b)) for all real b > a. Define the cumulative distribution 
function or cdf F as the probability of the event {x : x < r} as a 
function of r G 3?: 



F{r) = P{{-oo,r]) = f f{x)dx. 

J — OO 



(2.107) 



Find the cdf for the uniform pdf. Find the probability of the event 



G-[u:-.u:G ^ 



for some even k 



_ I I r 1 1 1 

k even ^ 

31. * Let n be a unit square {{x,y) : {x,y) G 3?^, —1/2 < x < 1/2, 
— 1/2 < y < 1/2} and let IF be the corresponding product Borel field. 
Is the circle {(x,y) : (x^ + y^Y^ < 1/2} in T1 (Give a plausibility 
argument.) If so, find the probability of this event if one assumes a 
uniform density function on the unit square. 

32. Given a pdf /, find the cumulative distribution function or cdf F 
defined as in (2.107) for the exponential, Laplacian, and Gaussian 
pdf’s. In the Gaussian case, express the cdf in terms of the function. 
Prove that if a > 6, then F(a) > F(6). What is ? 
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33. Let ri = 3?^ and suppose we have a pdf f{x,y) such that 



r C iix>0, y>0, x + y<l 
\ 0 otherwise . 



Find the probability P{{{x,y) : 2x > y}). Find the probability 
P{{{x,y) : X < a}) for all real a. Is / a product pdf? 

34. Prove that the product A:— dimensional pdf integrates to 1 over 3? 

35. Given the one-dimensional exponential pdf, find P{{x : x > r}) and 
the cumulative distribution function P{{x : x < r}) for r G 3?. 

36. Given the fc— dimensional product doubly exponential pdf, find the 
probabilities of the following events in 3 ?^: {x : xq > 0}, {x : Xi > 
0, alH = 0, 1, . . . ,k — 1}, {x : xq > xi\. 

37. Let (n,lF) = (3?, ,B(3?)). Let Pi be the probability measure on this 
space induced by a geometric pmf with parameter p and let P 2 be 
the probability measure induced on this space by an exponential pdf 
with parameter A. Form the mixture measure P = Pij2 + P 2 I 2 . Find 
P{{u! : uj > r}) for all r £ [ 0 ,oo). 

38. Let n = 3?^ and suppose we have a pdf f{x, y) such that 

f{x,y) = ; x € (—00,00) , y € [ 0 ,oo) . 

Find the constant C. Is f a product pdf? Find the probability 
Pr({(x, y) : \/\x\ < a}) for all possible values of a parameter a. Find 
the probability Pr({(a;,j/) : x^ < y}). 

39. Define g{x) by 

f '1 _ / X £ [ 0 , 00) 

9y^) “1^ Q otherwise . 



Let n = 3?^ and suppose we have a pdf f{x,y) such that 
f{x,y) = Cg{x)g{y-x) . 

Find the constant C. Find an expression for the probability P{{{x, y) : 
y < a}) as a function of the parameter a. If / a product pdf? 

40. Let n = 3?^ and suppose we have a pdf such that 



f(x,y) 



C\x\ — I < a; < 1; —1 < y < x 
0 otherwise . 



Find the constant C. Is f a product pdf? 
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41. Suppose that a probability space has as sample space 7^”, n-dimensional 
Euclidean space. (This is a product space.) Suppose that a multidi- 
mensional pdf / is defined on this space by 



/(x) 



C; maxj \xi\ < 1/2 
0; otherwise; 



that is, /(x) = C when —1/2 < Xi < 1/2 for i = 0, 1, • • • ,n — 1 and 
is 0 otherwise. 



(a) What is Cl 

(b) Is / a product pdf? 

(c) What is P({x : minimi > 0}), that is, the probability that the 
smallest coordinate value is nonnegative. 

Suppose next that we have a pdf g defined by 



ff(x) 



K- ||x||<l 
0; otherwise. 



where 

n—1 
i=0 

is the Euclidean norm of the vector x. Thus g is K inside an 
n-dimensional sphere of radius 1 centered at the origin. 

(d) What is the constant K1 (You may need to go to a book of 
integral tables to find this.) 

(e) Is this density a product pdf? 




42. Let be a probability space and consider events F,G, and 

H for which P{F) > P{G) > P{H) > 0. Events F and G form a 
partition of 12, and events F and iJ are independent. Can events G 
and H be disjoint? 

43. Given a probability space (12,1F, P), and let F,G, and FI be events 
such that P{F n G\H) = 1. Which of the following statements are 
true? Why or why not? 

(a) P{F n G) = 1 

(b) P(FnGniL) = P{H) 

(c) P(P‘=|P) = 0 
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(d) H = n 

44. (Courtesy of Prof. T. Cover) Suppose that the evidence of an event F 
increases the likelihood of a criminals guilt; that is, if G is the event 
that the criminal is guilty, then P{G\F) > P{G). The prosecutor 
discovers that the event F did not occur. What do you now know 
about the criminal’s guilt? Prove your answer. 

45. Suppose that X is a binary random variable with outputs {a, 6} with a 
pmf px(o) = P and px{b) = 1—p and P is a random variable described 
by the conditional pdf fY\x(y\x) exp—{y — x)'^j2a^j^/2Tra^. De- 
scribe the MAP detector for X given Y and find an expression for 
the probability of error in terms of the Q function. 

Suppose that p = 0.5, but you are free to choose a and b subject only 
to the constraint that (a^ -I- 6^)/2 = Eb. Which is a better choice, 
a = —b or a nonzero with & = 0? What can you say about the 
minimum achievable Pg? 




Chapter 3 



Random Variables, 
Vectors, and Processes 

3.1 Introduction 

This chapter provides the theoretical foundations and many examples of 
random variables, vectors, and processes. All three concepts are variations 
on a single theme and may be included in the general term of random object. 
We will deal specifically with random variables first because they are the 
simplest conceptually — they can be considered to be special cases of the 
other two concepts. 

3.1.1 Random Variables 

The name random variable suggests a variable that takes on values ran- 
domly. In a loose, intuitive way this is the right interpretation — e.g., an 
observer who is measuring the amount of noise on a communication link 
sees a random variable in this sense. We require, however, a more precise 
mathematical definition for analytical purposes. Mathematically a random 
variable is neither random nor a variable — it is just a function mapping 
one sample space into another space. The first space is the sample space 
portion of a probability space, and the second space is a subset of the real 
line (some authors would call this a “real- valued” random variable). The 
careful mathematical definition will place a constraint on the function to 
ensure that the theory makes sense, but for the moment we will adopt the 
informal definition that a random variable is just a function. 

A random variable is perhaps best thought of as a measurement on a 
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probability space; that is, for each sample point uj the random variable 
produces some value, denoted functionally as /(w). One can view uj as 
the result of some experiment and /(w) as the result of a measurement 
made on the experiment, as in the example of the simple binary quantizer 
introduced in the introduction to chapter 2. The experiment outcome uj 
is from an abstract space, e.g., real numbers, integers, ASCII characters, 
waveforms, sequences, Chinese characters, etc. The resulting value of the 
measurement or random variable /(w), however, must be “concrete” in the 
sense of being a real number, e.g., a meter reading. The randomness is all 
in the original probability space and not in the random variable; that is, 
once the uj is selected in a “random” way, the output value of sample value 
of the random variable is determined. 

Alternatively, the original point uj can be viewed as an “input signal” 
and the random variable / can be viewed as “signal processing,” i.e., the 
input signal uj is converted into an “output signal” /(w) by the random 
variable. This viewpoint becomes both precise and relevant when we indeed 
choose our original sample space to be a signal space and we generalize 
random variables by random vectors and processes. 

Before proceeding to the formal definition of random variables, vectors, 
and processes, we motivate several of the basic ideas by simple examples, 
beginning with random variables constructed on the fair wheel experiment 
of the introduction to chapter 2. 



A Coin Flip 

We have already encountered an example of a random variable in the in- 
troduction to chapter 2, where we defined a random variable q on the 
spinning wheel experiment which produced an output with the same pmf 
as a uniform coin flip. We begin by summarizing the idea with some slight 
notational changes and then consider the implications in additional detail. 

Begin with a probability space (12, E, P) where 12 = 3? and the proba- 
bility P is defined by (2.2) using the uniform pdf on [0, 1) of (2.4) Define 
the function F : 3? ^ {0, 1} by 



Y{r) 



0 if r < 0.5 

1 otherwise . 



(3.1) 



When Tyche performs the experiment of spinning the pointer, we do not 
actually observe the pointer, but only the resulting binary value of F. F 
can be thought of as signal processing or as a measurement on the original 
experiment. Subject to a technical constraint to be introduced later, any 
function defined on the sample space of an experiment is called a random 
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variable. The “randomness” of a random variable is “inherited” from the 
underlying experiment and in theory the probability measure describing 
its outputs should be derivable from the initial probability space and the 
structure of the function. To avoid confusion with the probability measure 
P of the original experiment, refer to the probability measure associated 
with outcomes of Y as Py. Py is called the distribution of the random 
variable Y . The probability Py{F) can be defined in a natural way as the 
probability computed using P of all the original samples that are mapped 
by Y into the subset F: 

Py{F) = P{{r : F(r) € F}). (3.2) 



In this simple discrete example Py is naturally defined for any subset F of 
Oy = {0, 1}, but in preparation for more complicated examples we assume 
that Py is to be defined for all suitably defined events, that is, for F G By, 
where By is an event space consisting of subsets of fly. The probability 
measure for the output sample space can be computed from the probability 
measure for the input using the formula (3.2), which will shortly be gener- 
alized. This idea of deriving new probabilistic descriptions for the outputs 
of some operation on an experiment producing inputs to the operation is 
fundamental to the theories of probability, random processes, and signal 
processing. 

For example, in our simple example (3.2) implies that 



Pv({0}) = P{{r-.Y{r) = 0}) 

= P{{r : 0 < r < 0.5}) 
= ^^([ 0 , 0 . 5 ]) 

= 0.5 

Py{{!}) = P((0.5,1.0]) 

= 0.5 

Prifly) = -Py({0, 1}) 

= P(3?) = 1 

Pv(0) = ^^(0)=o, 



so that every output event can be assigned a probability by Py by com- 
puting the probability of the corresponding input event under the input 
probability measure P. 

Eq. (3.2) can be written in a convenient compact manner by means of 
the definition of the inverse image of a set F under a mapping E : — > fly\ 

y-i(F) = {r : Y{r) G F|. 



(3.3) 
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With this notation (3.2) becomes 

Py{F) = P(Y-\F)); F C Oy; (3.4) 

that is, the inverse image of a given set (output) under a mapping is the 
collection of all points in the original space (input points) which map into 
the given (output) set. This result is sometimes called the fundamental de- 
rived distribution formula or the inverse image formula. It will be seen in a 
variety of forms throughout the book. When dealing with random variables 
it is common to interpret the probability Py{F) as “the probability that 
the random variable Y takes on a value in F" or “the probability that the 
event Y G F occurs.” These English statements are often abbreviated to 
the form Pr(F G F). 

The probability measure Py can be computed by summing a pmf, which 
we denote py. In particular, if we define 

pviy) = Py{{y}); y G Oy, (3.5) 

then additivity implies that 

Py{F) = J2PY{y); FGBy. (3.6) 

veF 

Thus the pmf describing a random variable can be computed as a special 
case of the inverse image formula (3.5), and then used to compute the 
probability of any event. 

The indirect method provides a description of the fair coin flip in terms 
of a random variable. The idea of a random variable can also be applied to 
the direct description of a probability space. Again as in the introduction 
to chapter 2, directly describe a single coin flip by choosing O = {0, 1} and 
assign a probability measure P on this space as in (2.12). Now define a 
random variable V : {0,1} ^ {0,1} on this space by 

V{r) = r. (3.7) 

Here V is trivial, it is just the identity mapping. The measurement just puts 
out the outcome of the original experiment and the inverse image formula 
trivially yields 

Pv{F) = P{F) 

Pv{v) = p{v). 



Note that this construction works on any probability space having the real 
line or a Borel subset thereof as a sample space. Thus for each of the named 
pmf ’s and pdf’s there is a random variable associated with that pmf or pdf. 
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If we have two random variables V and Y (which may be defined on 
completely separate experiments as in the present case), we say that they 
are equivalent or identically distributed if Pv{F) = Py{F) for all events F, 
that is, the two probability measures agree exactly on all events. It is easy 
to show with the inverse image formula that V is equivalent to Y and hence 
that 



Pf( 2/) =py(y) = 0.5; j/ = 0,1. (3.8) 

Thus we have two equivalent random variables, either of which can be used 
to model the single coin flip. Note that we do not say the random variables 
are equal since they need not be. For example, you could spin a pointer 
and find Y and I could flip my own coin to find V . The probabilities are 
the same, but the outcomes might or might not differ. 

3.1.2 Random Vectors 

The issue of the possible equality of two random variables raises an in- 
teresting point. If you are told that Y and V are two separate random 
variables with pmf’s py and pv, then the question of whether or not they 
are equivalent can be answered from these pmf’s alone. If you wish to 
determine whether or not the two random variables are in fact equal, how- 
ever, then they must be considered together or jointly. In the case where 
we have a random variable Y with outcomes in {0, 1} and a random vari- 
able V with outcomes in {0,1}, we could consider the two together as a 
single random vector {Y,V} with outcomes in the Cartesian product space 

Oyy = (0, 1}^ = 1(0, 0), (0, 1), (1, 0), (1, 1)1 with some pmf py,v describing 
the combined behavior 



PY,v{y,v) =Pt{Y = y,V = v) (3.9) 



so that 

Pr((F,C)GF)= ^ pYy{y,v); FgByv, 

y,v:(y,v)eF 

where in this simple discrete problem we take the event space Byv to be 
the power set of Dyv ■ Now the question of equality makes sense as we can 
evaluate the probability that the two are equal: 

Pr(y = y)= pY,v{y,v). 

y,v:y—v 

If this probability is 1, then we know that the two random variables are in 
fact equal with probability 1. 
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A random two-dimensional random vector (Y,V) is simply two random 
variables described on a common probability space. Knowledge of the indi- 
vidual pmf’s py and pv alone is not sufficient in general to determine Py,v, 
more information is needed. Either the joint pmf must be given to us or we 
must be told the definitions of the two random variables (two components 
of the two-dimensional binary vector) so that the joint pmf can be derived. 
For example, if we are told that the two random variables Y and V of our 
example are in fact equal, then Pr(K = V) = 1 and pY,v{y,v) = 0.5 for 
y = V, and 0 for y ^ v. This experiment can be thought of as flipping two 
coins that are soldered together on the edge so that the result is two heads 
or two tails. 

To see an example of radically different behavior, consider the random 
variable W : [0, 1) ^ {0, 1} by 

W(r) = l^ re [0.0, 0.25)010.5.0.75) 

1 1 otherwise. 

It is easy to see that W is equivalent to the random variables Y and V of 
this section, but W and Y are not equal even though they are equivalent 
and defined on a common experiment. We can easily derive the joint pmf for 
W and Y since the inverse image formula extends immediately to random 
vectors. Now the events involve the outputs of two random variables so 
some care is needed to keep the notation from getting out of hand. As in 
the random variable case, any probability measure on a discrete space can 
be expressed as a sum over a pmf on points, that is, 

Py,w{F)= ^ pY,w{y,w), (3.11) 

y,w:{y,w)^F 

where F C {0, 1}^, and where 

PY,w{y, w) = PY,w{{y, w}) = Pr(y = y,w = w); G {0, 1}, w e {0, 1}. 

(3.12) 

As previously observed, pmf’s describing the joint behavior of several ran- 
dom variables are called joint pmf’s and the corresponding distribution is 
called a joint distribution. Thus to And the entire distribution only re- 
quires finding the pmf, which can be done via the inverse image formula. 
For example, if (y,w) = (0,0), then 

PY,w{0,0) = P({r:y(r)=0,W(r) = 0}) 

= P([0, 0.5) f|([0.0, 0.25) |J[0.5, 0.75))) 

= ^^([0,0.25)) 

= 0.25 
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Similarly it can be shown that 

Py,w{0, 1) = Py,n/(1,0) = py,w{^, 1) = 0-25. 

Joint and marginal pmf’s can both be computed from the underlying 
distribution, but the marginals can also be found directly from the joints 
without reference to the underlying distribution. For example, pyivo) can 
be expressed as Py,w{F) by choosing F = {{y,w) : y = y^}. Then use the 
pmf formula for Py,w to write 

Py ivo) = Py.w{F) 

= X! PY,w{y,w) 

y,w:{y,w)^F 

= X! PY,w{yo,w). (3.13) 

Similarly 

Pw{wo)= PY,w{y,wo). (3.14) 

ySiflY 

This is an example of the consistency of probability, using different pmf’s 
derived from a common experiment to compute the probability of a single 
event must produce the same result — the marginals must agree with the 
joints. Consistency means that we can find marginals by “summing out” 
joints without knowing the underlying experiment on which the random 
variables are defined. 

This completes the derived distribution of the two random variables Y 
and W (or the single random vector (Y, W)) defined on the original uniform 
pdf experiment. For this particular example the joint pmf and the marginal 
pmf’s have the interesting property 

PY,w{y,w) = pY{y)pw{w), (3.15) 

that is, the joint distribution is a product distribution. A product distribu- 
tion better models our intuitive feeling of experiments such as flipping two 
fair coins and letting the outputs be Y and IF be 1 or 0 according to the 
coins landing heads or tails. 

In both of these examples cases the joint pmf had to be consistent 
with the individual pmf’s py and pv (called marginal pmf’s) in the sense 
of giving the same probabilities to events where both joint and marginal 
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probabilities make sense. In particular, 

Py{v) = Pr(b" = y) 

= Pr(r = y, G {0, 1}) 

1 

= '^PY,v{y,v), 

v—0 

an example of a consistency property. 

The two examples just considered of a random vector (Y,V) with the 
property Pr(y = V) = 1 and the random vector (Y, W) with the property 
PY,w{yjw) = PY{y)Pw{w) represent extreme cases of two-dimensional ran- 
dom vectors. In the first case Y = V and hence being told, say, that V = v 
also tells us that necessarily Y = v. Thus V depends on Y in a particu- 
larly strong manner and the two random variables can be considered to be 
extremely dependent. The product distribution, on the other hand, can be 
interpreted as implying that knowing one of the random variable’s outcome 
tells us absolutely nothing about the other, as is the case when flipping two 
fair coins. Two discrete random variables Y and W will be defined to be in- 
dependent if they have a product pmf, that is, if py.wiy, w) = PY{y)pw{w). 
Independence of random variables will be shortly related to the idea of in- 
dependence of events introduced in chapter 2, but for the moment simply 
observe that it can be interpreted as meaning that knowing the outcome 
of one random variable does not affect the probability distribution of the 
other. This is a very special case of general joint pmf’s. It may be sur- 
prising that two random variables defined on a common probability space 
can be independent of one another, but this was ensured by the specific 
construction of the two random variables Y and W. 

Note that we have also defined a three dimensional random vector 
(Y, V, W) because we have defined three random variables on a common 
experiment. Hence you should be able to find the joint pmf pyuv using 
the same ideas. 

Note also that in addition to the indirect derivations of a specific exam- 
ples of two-dimensional random variable, a direct development is possible. 
For example, let {0, 1}^ be a sample space with all of its four points hav- 
ing equal probability. Any point r in the sample space can be expressed 
as r = (ro,ri), where G {0,1} for i = 0, 1. Define the random vari- 
ables V : {0, 1}^ ^ (0, 1} and U : (0, 1}^ ^ (0, 1} by Y(ro,ri) = Tq and 
U{ro,ri) = r\. You should convince yourself that 

PY,w{y,w) =pv,u{y,w); y = 0,Y, w = 0, 1 

and that pyiv) = Pw{y) = Pv{y) = Pu{y): 2 / = 0, 1. Thus the random 
vectors (Y, W) and (V, U) are equivalent. 
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In a similar manner pdf’s can be used to describe continuous random 
vectors, but we shall postpone this step until a later section and instead 
move to the idea of random processes. 

3.1.3 Random Processes 

It is straightforward conceptually to go from one random variable to k ran- 
dom variables constituting a fc-dimensional random vector. It is perhaps 
a greater leap to extend the idea to a random process. The idea is at 
least easy to state, but it will take more work to provide examples and the 
mathematical details will prove more complicated. A random process is 
a sequence of random variables {A„; n = 0, 1, . . . } defined on a common 
experiment. It can be thought of as an infinite dimensional random vec- 
tor. To be more accurate, this is an example of a discrete-time, one-sided 
random process. It is called “discrete-time” because the index n which cor- 
responds to time takes on discrete values (here the nonnegative integers) 
and it is called “one-sided” because only nonnegative times are allowed. A 
discrete-time random process is also called a time series in the statistics 
literature and it is often denoted as {X{n) n = 0, 1, . . . } and is sometimes 
denoted by |A[n]} in the digital signal processing literature. Two ques- 
tions might occur to the reader: how does one construct an infinite family 
of random variables on a single experiment? How can one provide a direct 
development of a random process as accomplished for random variables 
and vectors? The direct development might appear hopeless since infinite 
dimensional vectors are involved. 

The first problem is reasonably easy to handle by example. Consider 
the usual uniform pdf experiment. Rename the random variables Y and W 
as Xq and X\, respectively. Consider the following definition of an infinite 
family of random variables A„ : [0, 1) ^ {0, 1} for n = 0, 1, ... . Every 
r S [0, 1) can be expanded as a binary expansion of the form 

OO 

r = J2bn{r)2-^-\ (3.16) 

n— 0 

This simply replaces the usual decimal representation by a binary represen- 
tation. For example, 1/4 is .25 in decimal and .01 or .010000 ... in binary, 
1/2 is .5 in decimal and yields the binary sequence .1000 . . . , 1/4 is .25 in 
decimal and yields the binary sequence .0100 . . . , 3/4 is .75 in decimal and 
.11000 . . . , and 1/3 is .3333 ... in decimal and .010101 ... in binary. 

Define the random process by A„(r) = 6«(r), that is, the nth term in 
the binary expansion of r. When n = 0, 1 this reduces to the specific Xq 
and Xi already considered. 
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The inverse image formula can be used to compute probabilities, al- 
though the calculus can get messy. Given the simple two-dimensional ex- 
ample, however, it should be reasonable that the pmf’s for random vectors 
of the form X” = (Xq, Xi, . . . , X„_i) can be evaluated as 

= Pr(X" = x”) = 2-"; x” G {0, 1}”, (3.17) 

where {0, 1}" is the collection of all 2" binary n-tuples. In other words, 
the first n binary digits in a binary expansion for a uniformly distributed 
random variable are all equally probable. Note that in this special case the 
joint pmf’s are again related to the marginal pmf’s in a product fashion, 
that is. 



n—1 

PX" = Y[pXi{xi), (3.18) 

i=0 

in which case the random variables Xq, Xi, . . . , X„_i are said to be mutu- 
ally independent or, more simply, independent. If a random process is such 
that any finite collection of the random variables produced by the process 
are independent and the marginal pmf’s are all the same (as in the case 
under consideration), the process is said to be independent identically dis- 
tributed or iid for short. An iid process is also called a Bernoulli process, 
although the name is sometimes reserved for a binary iid process. 

Something fundamentally important has happened here. If we have a 
random process, then the probability distribution for any random vectors 
formed by collecting outputs of the random process can be found (at least 
in theory) from the inverse image formula. The calculus may be a mess, but 
at least in some cases such as this one it is doable. Furthermore these pmf’s 
are consistent in the sense noted before. In particular, if we use (3.13-3.14) 
to compute the already computed pmf’s for Xq and Xi we get the same 
thing we did before, they are each equiprobable binary random variables. If 
we compute the joint pmf for Xq and Xi using (3.17) we also get the same 
joint pmf we got before. This observation likely seems trivial at this point 
(and it should be natural that the math does not give any contradictions), 
but it emphasizes a property that is critically important when trying to 
describe a random process in a more direct fashion. 

Suppose now that a more direct model of a random process is desired 
without a complicated construction on an original experiment. Here the 
problem is not as simple as in the random variable or random vector case 
where all that was needed was a consistent assignment of probabilities and 
an identity mapping. The solution is known as the Kolmogorov exten- 
sion theorem, named after the primary developer of modern probability 
theory. The theorem will be stated formally later in this chapter, but its 
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complicated proof will be left to other texts. The basic idea, however, can 
be stated in a few words. If one can specify a consistent family of pmf’s 
Px"(x"‘) for all n (we have done this for n = 1 and 2), then there exists 
a random process described by those pmf’s. Thus, for example, there will 
exist a random process described by the family of pmf’s (x"') = 2“" for 
x" G {0, 1}” for all positive integers n if and only if the family is consistent. 
We have already argued that the family is indeed consistent, which means 
that even without the indirect construction previously followed we can ar- 
gue that there is a well-defined random process described by these pmf’s. 
In particular, one can think of a “grand experiment” where Nature selects 
a one-sided binary sequence according to some mysterious probability mea- 
sure on sequences that we have difficulty envisioning. Nature then reveals 
the chosen sequence to us one coordinate at a time, producing the process 
Xg, Ni, X 2 , . . . , and the distributions of any finite collection of these ran- 
dom variables are known from the given pmf’s px" . Putting this in yet 
another way, describing or specifying the finite-dimensional distributions of 
a process is enough to completely describe the process (provided of course 
the given family of distributions is consistent). 

In this example the abstract probability measure on semiinfinite binary 
sequences is not all that mysterious, from our construction the sequence 
space can be considered to be essentially the same as the unit interval 
(each point in the unit interval corresponding to a binary sequence) and 
the probability measure is described by a uniform pdf on this interval. 

The second method of describing a random is by far the most common 
in practice. One usually describes a process by its finite sample behavior 
and not by a construction on an abstract experiment. The Kolmogorov 
extension theorem ensures that this works. Consistency is easy to demon- 
strate for iid processes, but unfortunately it becomes more difficult to verify 
in more general cases (and more difficult to define and demonstrate for con- 
tinuous time examples). 

Having toured the basic ideas to be explored in this chapter, we now 
proceed delve into the details required to make the ideas precise and general. 



3.2 Random Variables 

We now develop the promised precise definition of a random variable. As 
you might guess, a technical condition for random variables is required 
because of certain subtle pathological problems that have to do with the 
ability to determine probabilities for the random variable. To arrive at 
the precise definition, we start with the informal definition of a random 
variable that we have already given and then show the inevitable difficulty 
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that results without the technical condition. We have informally defined a 
random variable as being a function on a sample space. Suppose we have a 
probability space (fl, T , P). Let / : ^ 3? be a function mapping the same 

space into the real line so that / is a candidate for a random variable. Since 
the selection of the original sample point w is random, that is, governed by 
a probability measure, so should be the output of our measurement of 
random variable /(w). That is, we should be able to find the probability of 
an “output event” such as the event “the outcome of the random variable 
/ was between a and that is, the event F C 3? given by F = (a, 6). 
Observe that there are two different kinds of events being considered here: 

1. output events or members of the event space of the range or range 
space of the random variable, that is, events consisting of subsets of 
possible output values of the random variable; and 

2. input events or events, events in the original sample space of the 
original probability space. 

Can we find the probability of this output event? That is, can we make 
mathematical sense out of the quantity “the probability that / assumes 
a value in an event F C 3?”? On reflection it seems clear that we can. 
The probability that / assumes a value in some set of values must be the 
probability of all values in the original sample space that result in a value of 
/ in the given set. We will make this concept more precise shortly. To save 
writing we will abbreviate such English statements to the form Pr(/ G F), 
or Pr(F), that is, when the notation Pr(F) is encountered it should be 
interpreted as shorthand for the English statement for “the probability of 
an event F” or “the probability that the event F will occur” and not as a 
precise mathematical quantity. 

Recall from chapter 2 that for a subset F of the real line 3? to be an 
event, it must be in a sigma field or event space of subsets of 3?. Recall also 
that we adopted the Borel field ,8(3?) as our basic event space for the real 
line. Hence it makes sense to require that our output event F be a Borel 
set. 

Thus we can now state the question as follows: Given a probability 
space (H, F, P) and a function / : H ^ 3?, is there a reasonable and useful 
precise definition for the probability Pr(/ G F) for any F G 8(3?), the Borel 
field or event space of the real line? Since the probability measure F sits 
on the original measurable space (f?,F) and since / assumes a value in F 
if and only if w G H is chosen so that /(w) G F, the desired probability 
is obviously Pr(/ G F) = P({u> : /(w) G F}) = F(/“^(F)). In other 
words, the probability that a random variable / takes on a value in a Borel 
set F is the probability (defined in the original probability space) of the 
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set of all (original) sample points w that yield a value /(w) G F. This, in 
turn, is the probability of the inverse image of the Borel set F under the 
random variable /. This idea of computing the probability of an output 
event of a random variable using the original probability measure of the 
corresponding inverse image of the output event under the random variable 
is depicted in Figure 3.1. 




Figure 3.1: The inverse image method: Pr(/ G F) = P{{uj : uj G F}) = 

P(f-HF)) 



This natural definition of the probability of an output event of a random 
variable indeed makes sense if and only if the probability P{f~^{F)) makes 
sense, that is, if the subset f~^{F) of 0 corresponding to the output event 
F is itself an event, in this case an input event or member of the event 
space T of the original sample space. This, then, is the required technical 
condition: A function / mapping the sample space of a probability space 
(f2, IF, P) into the real line 3? is a random variable if and only if the inverse 
images of all Borel sets in 3? are members of P, that is, if all of the Lt 
sets corresponding to output events (members of .8(3?)) are input events 
(members of P). Unlike some of the other pathological conditions that we 
have met, it is easy to display some trivial examples where the technical 
condition is not met (as we will see in Example [3.11]). We now formalize 
the definition: 

Given a probability space (Ll,P,P) a (real-valued) random variable is 
a function f : U ^ 3? with the property that if F G 8(3?), then also 
f-HP) = W:fH€F}GP. 
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Given a random variable / defined on a probability space (f2, E, P), the 
set function 

Pf{F) ^ p{r\F)) 

= P{{co : fico) e F}) 

= Pr(/ eF); Fe B{^) (3.19) 

is well defined since by definition f~^{F) G F for all F G ,8(3?). In the next 
section the properties of distributions will be explored. 

In some cases one may wish to consider a random variable with a more 
limited range space than the real line, e.g., when the random variable is 
binary. (Recall from chapter A that the range space of / is the image of 
n.) If so, 3? can be replaced in the definition by the appropriate subset, say 
A C 3?. This is really just a question of semantics since the two definitions 
are equivalent. One or the other view may, however, be simpler to deal 
with for a particular problem. 

A function meeting the condition in the definition we have given is 
said to be measurable. This is because such functions inherit a probabil- 
ity measure on their output events (specifically a probability measure in 
our context; in other contexts more general measures can be defined on a 
measurable space. 

If a random variable has a distribution described by a pmf or a pdf with 
a specific name, then the name is often applied also to the random variable; 
e.g., a continuous random variable with a Gaussian pdf is called a Gaussian 
random variable. 

Examples 

In every case we are given a probability space {0,F,P). For the moment, 
however, we will concentrate on the sample space O and the random variable 
that is defined functionally on that space. Note that the function must be 
defined for every value in the sample space if it is to be a valid function. 
On the other hand, the function does not have to assume every possible 
value in its range. 

As you will see, there is nothing particularly special about the names 
of the random variables. So far we have used the lower case letter /. 
On occasion we will use other lower case letters such as g and h. As we 
progress we will follow custom and more often use upper case letters late 
in the alphabet, such as X, Y, Z, U, V, and W. Gapital Greek letters like 0 
and 'k are also popular. 

The reader should keep the signal processing interpretation in mind 
while considering these examples, several very common types of signal pro- 
cessing are considered, including quantization, sampling, and filtering. 
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[ 3 . 1 ] Let = 5ft, the real line, and define the random variable X : fl ^ fl 
by X(oj) = for all co G fl. Thus the random variable is the square 
of the sample point. Note that since the square of a real number 
is always nonnegative, we could replace the range fl by the range 
space [0, oo) and consider X as a mapping X : fl ^ [0, oo). Other 
random variables mapping fl into itself are V(oj) = |u;|, Z(fl) = 
sin(a;), U(uj) = 3 x oj + 321.5, and so on. We can also consider 
the identity mapping as a random variable; that is, we can define a 
random variable W : Lt ^ Lt by W(oj) = co. 

[ 3 . 2 ] Let O = 5ft as in example [3.1] and define the random variable / : 
to {-V,V} by 



f{r) 



+V if r > 0 

—V if r < 0 . 



This example is a variation of the binary quantizer of a real input con- 
sidered in the introduction to chapter 2. With this specific choice of output 
levels it is also called a hard limiter. 

So far we have used lo exclusively to denote the argument of the random 
variable. We can, however, use any letter to denote the dummy variable (or 
argument or independent variable) of the function, provided that we specify 
its domain; that is, we do not need to use w all the time to specify elements 
of LI: r, X, or any other dummy variable will do. We will, however, as a 
convention, always use only lower case letters to denote dummy variables. 

When referring to a function, we will use several methods of specifi- 
cation. Sometimes we will only give its name, say /; sometimes we will 
specify its domain and range, as in / : — s- A; sometimes we will provide 

a specific dummy variable, as in /(r); and sometimes we will provide the 
dummy variable and its domain, as in /(r);r G LI. Finally, functions can 
be shown with a place for the dummy variable marked by a period to avoid 
annointing any particular dummy variable as being somehow special, as in 
/(•). These various notations are really just different means of denoting the 
same thing while emphasizing certain aspects of the functions. The only 
real danger of this notation is the same as that of calculus and trigonom- 
etry: if one encounters a function, say sint, does this mean the sine of a 
particular t (and hence a real number) or does it mean the entire waveform 
of sint for all tl The distinction should be clear from the context, but the 
ambiguity can be removed, for example, by defining something like sin to 
to mean a particular value and {sint; t G 5ft} or sin(-) to mean the entire 
waveform. 

[ 3 . 3 ] Let U be as in example [3.1] and / as in [3.2]. Then the function 
g : Lt ^ Lt defined by g{ui) = is also a random variable. This 
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relation is often abbreviated by dropping the explicit dependence on 
u to write g = f(U). More generally, any function of a function is 
another function, called a “composite” function. Thus a function of a 
random variable is another random variable. Similarly, one can con- 
sider a random variable formed by a complicated combination of other 
random variables — for example, giuj) = ^ sinh“^[7r x ^]. 

[3.4] Let n = 3?^, fc-dimensional Euclidean space. Occasionally it is of 
interest to focus attention on the random variable which is defined 
as a particular coordinate of a vector w = {xq,xi, . . . ,Xk-i) G 3 ?^. 
Toward this end we can define for each z = 0, 1, . . . , fc — 1 a sam- 
pling function (or coordinate function) 11^: 3?^ ^ 3? as the following 
random variable: 



IIj(a;) = n*((a:o, ■ ■ ■ , Xk-i)) = Xi . 

The sampling functions are also called “projections” of the higher di- 
mensional space onto the lower. (This is the reason for the choice of II 
Greek P — not to be confused with the product symbol n — to denote 
the functions.) 

Similarly, we can define a sampling function for any product space, e.g., 
for sequence and waveform spaces. 

*[3.5] Given a space A, an index set T, and the product space A"^ , define 
as a random variable, for any fixed t £ T , the sampling function 
Hi : ^ xl as follows: since any to € is a vector or function of 

the form {xs', s G T}, define for each t in T the mapping 

IIt(a;) = IIt({xs; s G T}) = xt ■ 

Thus, for example, if is a one-sided binary sequence space 

n {o,i}i={o,i}^+, 

and hence every point has the form lo = {xq, xi,. . . ), then Il 3 (( 0 , 1, 1, 0, 0, 0, 1, 0, 1, . . . )) 
0. As another example, if for all t in the index set 3?t is a replica of 3? and 
0 is the space 

3?^ = n 

teSR 

of all real-valued waveforms {a:(t); t G (— 00 , 00 )}, then for to = {sint; t G 
3?}, the value of the sampling function at the particular time t = 2 tt is 

Il 27 i-({sint; t G 3?}) = sin27r = 0 . 




3.2. RANDOM VARIABLES 



101 



[ 3 . 6 ] Suppose that we have a one-sided binary sequence space {0,1}^+. 
For any n G {1,2,...}, define the random variable Yn by l^(w) = 
Yni(xo, Xi, X 2 , • ■ • )) = the index (time) of occurrence of the 1 in 
uj. For example, l 2 (( 0 , 0, 0, 1, 0, 1, 1, 0, 1, ... )) = 5 because the second 
sample to be 1 is X 5 . 

[ 3 . 7 ] Say we have a one-sided sequence space O = 0^62 :+ where is 
a replica of the real line for each i in the index set. Since every u in 
this space has the form |a;o,a;i, . . . } = {xf, i G Z^}, we can define 
for each positive integer n the random variable, depending on n, 

n—1 

i G Z^^'j — Tl ^ ) Xi 

i=0 

the arithmetic average or “mean” of the first n coordinates of the 
infinite sequence. 

For example, if w = (1, 1, 1, 1, 1,1,1,... }, then S'„ = 1. This average 
is also called a Cesaro mean or sample average or time average since the 
index being summed over often corresponds to time; viz., we are adding the 
outputs at times 0 through n — 1 in the preceding equation. Such arithmetic 
means will later be seen to play a fundamental role in describing the long- 
term average behavior of random processes. The arithmetic mean can also 
be written using coordinate functions as 

n—1 

=n-i^n,(w) , (3.20) 

which we abbreviate to 

n—1 

5„ = n-i^n, (3.21) 

i =0 

by suppressing the dummy variable or argument to. Equation (3.21) is 
shorthand for (3.20) and says the same thing: The arithmetic average of 
the first n terms of a sequence is the sum of the first n coordinates or 
samples of the sequence. 

[ 3 . 8 ] As a generalization of the sample average consider weighted averages 

of sequences. Such weighted averages occur in the convolutions of 
linear system theory. Let Lt be the space where are 

all copies of the real line. Suppose that {/ifc; k = 0,1,2,...} is a 
fixed sequence of real numbers that can be used to form a weighted 
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average of the coordinates of w € Each to in this space has the 
form uj = {. . . , x_i, xo, xi, . . .) = {xi; i G Z} and hence a weighted 
average can be defined for each integer n the random variable 

OO 

F„(w) = '^hkXn-k ■ 

k=0 



Thus the random variable Y„ is formed as a linear combination of the 
coordinates of the sequence constituting the point u> in the double-sided 
sequence space. This is a discrete time convolution of an input sequence 
with a linear weighting. In linear system theory the weighting is called a 
unit pulse response (or Kronecker delta response or S response), and it is 
the discrete time equivalent of an impulse response. Note that we could 
also use the sampling function notation to write Yn, as a weighted sum of 
the sample random variables. 

[ 3 . 9 ] In a similar fashion, complicated random variables can be defined on 
waveform spaces. For example, let O = the space of all real- 

ten 

valued functions of time such as voltage-time waveforms. For each T, 
define a time average 

Yt{lo) = Yr{{x{t); t G 3?}) = T~^ f x{t)dt , 

Jo 

or given the impulse response h(t) of a causal, linear time-invariant 
system, we define a weighted average 

Wt{oj) = / h{t)x{T — t)dt . 

Jo 

Are these also random variables? They are certainly functions defined 
on the underlying sample space, but as one might suspect, the sample 
space of all real-valued waveforms is quite large and contains some bizarre 
waveforms. For example, the waveforms can be sufficiently pathological to 
preclude the existence of the integrals cited (see chapter 2 for a discussion 
of this point). These examples are sufficiently complicated to force us now 
to look a bit closer at a proper definition of a random variable and to 
develop a technical condition that constrains the generality of our definition 
but ensures that the definition will lead to a useful theory. It should be 
pointed out, however, that this difficulty is no accident and is not easily 
solved: waveforms are truly more complicated than sequences because of 
the wider range of possible waveforms, and hence continuous time random 
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processes are more difficult to deal with rigorously than are discrete time 
processes. One can write equations such as the integrals and then find 
that the integrals do not make sense even in the general Lebesgue sense. 
Often fairly advanced mathematics are required to properly patch up the 
problems. For purposes of simplicity we usually concentrate on sequences 
(and hence on discrete time) rather than waveforms, and we gloss over the 
technical problems when we consider continuous time examples. 

One must know the event space being considered in order to determine 
whether or not a function is a random variable. While we will virtually 
always assume the usual event spaces (that is, the power set for discrete 
spaces, the Borel field for the real line or subsets of the real line, and the 
corresponding product event spaces for product sample spaces), it is useful 
to consider some other examples to help clarify the basic definition. 

[ 3 . 10 ] First consider (fl, T, P) where Lt is itself a discrete subset of the real 
line 3?, e.g., {0, 1} or Z+. If, as usual, we take T to be the power set, 
then any function / : — > 3? is a random variable. This follows since 
the inverse image of any Borel set in 3? must be a subset of Lt and 
hence must be in the collection of all subsets of Lt. 

Thus with the usual event space for a discrete sample space — the power 
set — any function defined on the probability space is a random variable. 
This is why all of the structure of event spaces and random variables is 
not seen in elementary texts that consider only discrete spaces: There is no 
need. 

It should be noted that for any Lt, discrete or not, if T is the power set, 
then all functions defined on LI are random variables. This fact is useful, 
however, only for discrete sample spaces since the power set is not a useful 
event space in the continuous case (since we cannot endow it with useful 
probability measures). 

If, however, P is not the power set, some functions defined on LI are not 
random variables, as the following simple example shows: 

[ 3 . 11 ] Let Lt be arbitrary, but let F be the trivial sigma field {fl,0}. 
On this space it is easy to construct functions that are not random 
variables (and hence are non-measurable functions). For example, 
let Ll = {0, 1} and define /(w) = w, the identity function. Then 
/“^({O}) = {0} is not in IF, and hence this simple function is not a 
random variable. In fact, it is obvious that any function that assigns 
different values to 0 and 1 is not a random variable. Note, however, 
that some functions are random variables. 

The problem illustrated by this example is that the input event space 
is not big enough or “fine” enough to contain all input sets corresponding 
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to output events. This apparently trivial example suggests an important 
technique for dealing with advanced random process theory, especially for 
continuous time random processes: If the event space is not large enough 
to include the inverse image of all Borel sets, then enlarge the event space 
to include all such events, viz., by using the power set as in example [3.10]. 
Alternatively, we might try to force E to contain all sets of the form 
F G ^(3?); that is, make T the sigma field generated by such 
sets. Further treatment of this subject is beyond the scope of the book. 
However, it is worth remembering that if a sigma field is not big enough 
to make a function a random variable, it can often be enlarged to be big 
enough. This is not idle twiddling; such a procedure is required for impor- 
tant applications, e.g., to make integrals over time defined on a waveform 
space into random variables. 

On a more hopeful tack, if the probability space (fl, IF, P) is chosen with 
H = 3? and F = B(3t), then all functions / normally encountered in the 
real world are in fact random variables. For example, continuous functions, 
polynomials, step functions, trigonometric functions, limits of measurable 
functions, maxima and minima of measurable functions, and so on are 
random variables. It is, in fact, extremely difficult to construct functions on 
Borel spaces that are not random variables. The same statement holds for 
functions on sequence spaces. The difficulty is comparable to constructing 
a set on the real line that is not a Borel set and is beyond the scope of this 
book. 

So far we have considered abstract philosophical aspects in the defini- 
tion of random variables. We are now ready to develop the probabilistic 
properties of the defined random variables. 



3.3 Distributions of Random Variables 

3.3.1 Distributions 

Suppose we have a probability space with a random variable, X, 

defined on the space. The random variable X takes values on its range 
space which is some subset A of 3? (possibly A = 3?). The range space A of 
a random variable is often called the alphabet of the random variable. As 
we have seen, since A is a random variable, we know that all subsets of 12 
of the form X~^{F) = {u) : X{uj) G F}, with F £ B{A), must be members 
of F by definition. Thus the set function Px defined by 

Px{F) = P{X~\F)) = P{{uj : X{u;) G F}) ; F G B{A) (3.22) 

is well defined and assigns probabilities to output events involving the ran- 
dom variable in terms of the original probability of input events in the orig- 
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inal experiment. The three written forms in equation (3.22) are all read 
as Pr(X G F) or “the probability that the random variable X takes on a 
value in T’.” Furthermore, since inverse images preserve all set-theoretic 
operations (see problem A. 12), Px satisfies the axioms of probability as a 
probability measure on (A,B{A)) — it is nonnegative, Px{A) = 1, and it 
is countably additive. Thus Px is a probability measure on the measurable 
space {A, B{A)). Therefore, given a probability space and a random variable 
X, we have constructed a new probability space {A,B{A),Px) where the 
events describe outcomes of the random variable. The probability measure 
Px is called the distribution of X (as opposed to a “cumulative distribution 
function” of X to be introduced later). 

If two random variables have the same distribution, then they are said to 
be equivalent since they have the same probabilistic description, whether 
or not they are defined on the same underlying space or have the same 
functional form (see problem 3.22). 

A substantial part of the application of probability theory to practical 
problems is devoted to determining the distributions of random variables, 
performing the “calculus of probability.” One begins with a probability 
space. A random variable is defined on that space. The distribution of the 
random variable is then derived, and this results in a new probability space. 
This topic is called variously “derived distributions” or “transformations of 
random variables” and is often developed in the literature as a sequence 
of apparently unrelated subjects. When the points in the original sample 
space can be interpreted as “signals,” then such problems can be viewed 
as “signal processing” and derived distribution problems are fundamental 
to the analysis of statistical signal processing systems. We shall emphasize 
that all such examples are just applications of the basic inverse image for- 
mula (3.22) and form a unified whole. In fact, this formula, with its vector 
analog, is one of the most important in applications of probability theory. 
Its specialization to discrete input spaces using sums and to continuous 
input spaces using integrals will be seen and used often throughout this 
book. 

It is useful to bear in mind both the mathematical and the intuitive 
concepts of a random variable when studying them. Mathematically, a 
random variable, say X, is a “nice” (= measurable) real-valued function 
defined on the sample space of a probability space {Ll,P, P). Intuitively, a 
random variable is something that takes on values at random. The random- 
ness is described by a distribution Px, that is, by a probability measure on 
an event space of the real line. When doing computations involving ran- 
dom variables, it is usually simpler to concentrate on the probability space 
(A,B{A), Px), where A is the range space of X, than on the original prob- 
ability space {Ll,P,P). Many experiments can yield equivalent random 
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variables, and the space (A,B{A),Px) can be considered as a canonical 
description of the random variable that is often more useful for compu- 
tation. The original space is important, however, for two reasons. First, 
all distribution properties of random variables are inherited from the orig- 
inal space. Therefore much of the theory of random variables is just the 
theory of probability spaces specialized to the case of real sample spaces. 
If we understand probability spaces in general, then we understand ran- 
dom variables in particular. Second, and more important, we will often 
have many interrelated random variables defined on a common probability 
space. Because of the interrelationships, we cannot consider the random 
variables independently with separate probability spaces and distributions. 
We must refer to the original space in order to study the dependencies 
among the various random variables (or consider the the random variables 
jointly as a random vector). 

Since a distribution is a special case of a probability measure, in many 
cases it can be induced or described by a probability function, i.e., a pmf or 
a pdf. If a range space of the random variable is discrete or, more generally, 
if there is a discrete subset of the range space A such that Px{A) = 1, then 
there is a pmf, say px , corresponding to the distribution Px ■ The two are 
related via the formulas 

Px{x) = Px{{x}) , all a; G ^ , (3.23) 

where A is the range space or alphabet of the random variable, and 

Px{F) = Y,Px{x)-, F G B{A) . (3.24) 

xeF 

In (3.23) both quantities are read as Pr(X = x). 

The pmf and the distribution imply each other from (3.23) and (3.24), 
and hence either formula specifies the random variable. 

If the range space of the random variable is continuous and if a pdf fx 
exists, then we can write the integral analog to (3.24): 

Px{F) = [ fx{x)dx ; F G B{A) . (3.25) 

J F 

There is no direct analog of (3.23) since a pdf is not a probability. An ap- 
proximate analog of (3.23) follows from the mean value theorem of calculus. 
Suppose that F = [x,x + Ax), where Ax is extremely small. Then if fx is 
sufficiently smooth, the mean value theorem implies that 




Px{[x,x + Ax)) 



/x(a) da « fx{x)Ax, 



(3.26) 
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so that if we multiply a pdf fx{x) by a differential Ax, it can be interpreted 
as (approximately) the probability of being within Ax of x. It is desirable, 
however, to have an exact pair of results like (3.23) and (3.24) that show how 
to go both ways, that is, to get the probability function from the distribution 
as well as vice versa. From considerations of elementary calculus it seems 
that we should somehow differentiate both sides of (3.25) to yield the pdf 
in terms of the distribution. This is not immediately possible, however, 
because F is a set and not a real variable. Instead to find a pdf from a 
distribution, we use the intermediary of a cumulative distribution function 
or cdf. We pause to give the formal definition. 

Given a random variable X with distribution Px, the cumulative dis- 
tribution function or cdf Fx is defined by 

Fx{a) = Px{{—oo, a]) = Px{{x : x < a}) ; a S 3? . 

The cdf is seen to represent the cumulative probability of all values of 
the random variable in the infinite interval from minus infinity up to and 
including the real number argument of the cdf. The various forms can be 
summarized as Fx{cx) = Pr(A < a). If the random variable X is defined 
on the probability space {fl,iF,P), then by definition 

Fx{a) = P(A-i((-oo, a])) = P{{lo : X{to) < a}) . 

If a distribution possesses a pdf, then the cdf and pdf are related through 
the distribution and (3.25) by 

/ Q 

fx{x)dx ; a € 3? . (3.27) 

-OO 

The motivation for the definition of the cdf in terms of our previous 
discussion is now obvious. Since integration and differentiation are mutu- 
ally inverse operations, the pdf is determined from the cdf (and hence the 
distribution) by 

fx{a)=^^^-, aG^. (3.28) 

da 

where, as is customary, the right-hand side is shorthand for 

dFx{x) I 

, \x = Ct 1 

dx 

the derivative evaluated at a. Alternatively, (3.28) also follows from the 
fundamental theorem of calculus and the observation that 

Px{{a,b]) = [ fx{x) dx = Fx{b) - Fx{a) . 



(3.29) 
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Thus (3.27) and (3.28) together show how to find a pdf from a distribution 
and hence provide the continuous analog of (3.23). Equation (3.23) is useful, 
however only if the derivative, and hence the pdf, exists. Observe that the 
cdf is always well defined (because the semi-infinite interval is a Borel set 
and therefore an event), regardless of whether or not the pdf exists in both 
the continuous and the discrete alphabet cases. For example, if X is a 
discrete alphabet random variable with alphabet Z and pmf px, then the 
cdf is 



X 

Fx{x) = ^ px{k) , (3.30) 

k——oo 

the analogous sum to the integral of (3.27). Furthermore, for this example, 
the pmf can be determined from the cdf (as well as the distribution) as 

Px{x) = Fx{x) - Fx{x -1) , (3.31) 

a difference analogous to the derivative of (3.28). 

It is desirable to use a single notation for the discrete and continuous 
cases whenever possible. This is accomplished for expressing the distribu- 
tion in terms of the probability functions by using a Stieltjes integral, which 
is defined as 

[ y^.Pxjx) 

Px{F)= [ dFx{x)= [ lF{x)dFx{x) = { 

\ J fx{x)dx 

Thus (3.32) is a combination of both (3.24) and (3.25). 

3.3.2 Mixture Distributions 

More generally, we may have a random variable that has both discrete and 
continuous aspects and hence is not describable by either a pmf alone or 
a pdf alone. For example, we might have a probability space (3?, ,8(3?), P), 
where P is described by a Gaussian pdf /(w); w G 3?. The sample point lo G 
3? is input to a soft limiter with output X{uj ) — a device with input /output 
characteristic X defined by 

{ -1 UJ <-1 

to wG(-l,l) (3.33) 

1 1 < w 



if X is discrete 

if X has a pdf . 
(3.32) 




3.3. DISTRIBUTIONS OF RANDOM VARIABLES 



109 



As long as |u;| < 1, X{uj) = uj. But for values outside this range, the output 
is set equal to -1 or +1. Thus all of the probability density outside the 
limiting range “piles up” on the ends so that Pr(A(w) = 1) = f{u)duj 
is not zero. As a result X will have a mixture distribution, described by a 
pdf in (—1, 1) and by a pmf at the points ±1. 

Random variables of this type can be described by a distribution that is 
the weighted sum of two other distributions — a discrete distribution and 
a continuous distribution. The weighted sum is an example of a mixture 
distribution, that is, a mixture of probability measures as in example [2.18]. 
Specifically, let P\ be a discrete distribution with corresponding pmf p, and 
let P 2 be a continuous distribution described by a pdf /. For any positive 
weights Cl , C 2 with ci + C 2 = 1 , the following mixture distribution Px is 
defined: 



Px{F) 



CxPx{F) + C2P2{F) 



k^F 



C 2 / f{x)dx 
J F 



Cl ^ lF(fc)p(fc) + C 2 J lF{r)f{x) dx 
FgB{^). 



(3.34) 



For example, the output of the limiter of (3.33) has a pmf which places 
probability one half on ±1, while the pdf is Gaussian-shaped for magnitudes 
less than unity (i.e., it is a truncated Gaussian pdf normalized so that the 
pdf integrates to one over the range (—1,1)). The constant ci is the integral 
of the pdf over (—1, 1) and C 2 = 1 — Ci. Observe that the cdf for a random 
variable with a mixture distribution is 



Fx{a) 



Cl E p{k) +C 2 f{x)da 

k-.k<a 

CiFi{a) + C2F2{a) , 



(3.35) 



where Fi and F 2 are the cdf’s corresponding to P\ and P 2 respectively. 

The combined notation for discrete and continuous alphabets using the 
Stieltjes integral notation of (3.32) also can be used as follows. Given a 
random variable with a mixture distribution of the form (3.34), then 



P 



X 



(F) = dFx{x) = J 1f{x) dFx{x) ; F e B{^) 



where 



(3.36) 



1f{x) dFx{x) = Cl E If{x)p{x) + C2 



lF{x)f{x)dx . 



(3.37) 
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Observe that (3.36) and (3.37) includes (3.32) as a special case where either 
Cl or C 2 is 0. Equations (3.36) and (3.37) provides a general means for 
finding the distribution of a random variable X given its cdf, provided the 
distribution has the form of (3.35). 

All random variables can be described by a cdf. But, more subtly, do 
all random variables have a cdf of the form (3.35)? The answer is almost 
yes. Certainly all of the random variables encountered in this course and 
in engineering practice have this form. It can be shown, however, that the 
most general cdf has the form of a mixture of three cdf ’s: a continuous and 
differentiable piece induced by a pdf, a discrete piece induced by a pmf, and 
a third pathological piece. The third piece is an odd beast wherein the cdf 
is something called a singular function — the cdf is continuous (it has no 
jumps as it does in the discrete case), and the cdf is differentiable almost 
everywhere (here “almost everywhere” means that the cdf is differentiable 
at all points except some set F for which dx = 0), but this derivative is 0 
almost everywhere and hence it cannot be integrated to find a probability! 
Thus for this third piece, one cannot use pmf’s or pdf’s to compute proba- 
bilities. The construction of such a cdf is beyond the scope of this text, but 
we can point out for the curious that the typical example involves placing 
probability measures on the Cantor set that was considered in problem 218. 
At any rate, as such examples almost never arise in practice, we shall ignore 
them and henceforth consider only random variables for which (3.36) and 
(3.37) holds. 

While the general mixture distribution random variable has both dis- 
crete and continuous pieces, for pedagogical purposes it is usually simplest 
to treat the two pieces separately - i.e., to consider random variables that 
have either a pdf or a pmf. Hence we will rarely consider mixture distri- 
bution random variables and will almost always focus on those that are 
described either by a pmf or by a pdf and not both. 

To summarize our discussion, we will define a random variable to be a 
discrete, continuous, or mixture random variable depending on whether 
it is described probabilistically by a pmf, pdf, or mixture as in (3.36) and 
(3.37) with Cl, C 2 > 0. 

We note in passing that some texts endeavor to use a uniform approach 
to mixture distributions by permitting pdf’s to possess Dirac delta or im- 
pulse functions. The purpose of this approach is to permit the use of the 
continuous ideas in discrete cases, as in our limiter output example. If the 
cdf is differentiated, then a legitimate pdf results (without the need for a 
pmf) if a delta function is allowed at the two discontinuities of the cdf. 
As a general practice we prefer the Stieltjes notation, however, because 
of the added notational clumsiness resulting from using pdf’s to handle 
inherently discrete problems. For example, compare the notation for the 
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geometric pmf with the corresponding pdf that is written using Dirac delta 
functions. 

3.3.3 Derived Distributions 

[3.12] Let (n, T, P) be a discrete probability space with Ft a discrete subset 
of the real line and T the power set. Let p be the pmf corresponding 
to P, that is, 

p(w) = P({w}) , all w G n . 

{Note: There is a very subtle possibility for confusion here. p{uj) could 
be considered to be a random variable because it satisfies the defini- 
tion for a random variable. We do not use it in this sense, however; 
we use it as a pmf for evaluating probabilities in the context given. In 
addition, no confusion should result because we rarely use lower case 
letters for random variables.) Let X be a random variable defined on 
this space. Since the domain of X is discrete, its range space. A, is 
also discrete (refer to the definition of a function to understand this 
point). Thus the probability measure Px must also correspond to a 
pmf, say px', that is, (3.23) and (3.24) must hold. Thus we can derive 
either the distribution Px or the simpler pmf px in order to complete 
a probabilistic description of X. Using (3.22) yields 

px{x) = Px{{x}) = P{X-\{x})) = pM ■ (3-38) 

LlT.X — X 



Equation (3.38) provides a formula for computing the pmf and hence 
the distribution of any random variable defined on a discrete probability 
space. As a specific example, consider a discrete probability space (U, P, P) 
with Ft = Z_|_, P the power set of Ft, and P the probability measure induced 
by the geometric pmf. Define a random variable Y on this space by 

, f 1 if a; even 
= I 0 if c. odd 

where we consider 0 (which has probability zero under the geometric pmf) 
to be even. Thus we have a random variable Y : ^ {0, 1}. Using the 

formula (3.38) for the pmf for Y{uj) = 1 results in 

Py{^) = X! (i-p)*”V= X! 

cDrcjeven fc=2,4,... 

oo oo 

= (1 ^ ~ ~ 

^ k^l k^O 

^ (1 -p) ^ 1 -p 

^l-(l-p)2 2-p ’ 
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where we have used the standard geometric series summation formula (in a 
thinly disguised variation of an example of section 2.2.4). We can cal- 
culate the remaining point in the pmf from the axioms of probability: 
Pf(0) = 1 — pf( 1)- Thus we have found a non-obvious derived distri- 
bution by computing a pmf via (3.38), a special case of (3.22). Of course, 
given the pmf, we could now calculate the distribution from (3.24) for all 
four sets in the power set of {0,1}. 

[3.13] Say we have a probability space (3?, ,B(3?), F) where P is described 
by a pdf g; that is, g is a, nonnegative function of the real line with 
total integral 1 and 



P{F) = / g{r) dr ; Fe B{^) . 

Jr&F 

Suppose that we have a random variable X : 3? ^ 3?. We can use 
(3.22) (3.24) to write a general formula for the distribution of X: 

Px{F) = P{X-\F))= j g{r)dr . 

Jr-.X{r)eF 



Ideally, however, we would like to have a simpler description of X. In 
particular, if X is a “reasonable function” it should have either a discrete 
range space (e.g., a quantizer) or a continuous range space (or possibly 
both, as in the general mixture case). If the range space is discrete, then X 
can be described by a pmf, and the preceding formula (with the requisite 
change of dummy variable) becomes 

px{x) = Px{{x}) = / g{r) dr . 

J r: X (r)-=x 

If, however, the range space is continuous, then there should exist a pdf 
for X, say /x, such that (3.25) holds. How do we find this pdf? As 
previously discussed, to find a pdf from a distribution, we first find the cdf 
Fx- Then we differentiate the cdf with respect to its argument to obtain 
the pdf. As a nontrivial example, suppose that we have a probability space 
(3?, F(3?),P) with P the probability measure induced by the Gaussian pdf. 
Define a random variable W : 3? ^ 3? by IT(r) = r G 3?. Following the 
described procedure, we first attempt to find the cdf Fw for W : 

Fw{w) = Pr(W <w) = F({w : bF(w) = < w}) 

= P([— if ic > 0 . 
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The cdf is clearly 0 if w < 0. Since P is described by a pdf, say g (the 
specific Gaussian form is not yet important), then 



Fw{w) 






/-ml/2 



g{r) dr . 



If one should now try to plug in the specific form for the Gaussian density, 
one would quickly discover that no closed form solution exists. Happily, 
however, the integral does not have to be evaluated explicitly — we need 
only its derivative. Therefore we can use the following handy formula from 
elementary calculus for differentiating the integral: 



— / g{r)dr = g{b{w))— 5 (a(w)) 

dw Ja(w) dw 



da{w) 

dw 



Application of the formula yields 



fw{w) = g{w'^^^) ) 



(3.39) 



(3.40) 



The final answer is found by plugging in the Gaussian form of g. For 
simplicity we do this only for the special case where m = 0. Then g is 
symmetric; that is, g{w) = g{—w), so that 

fw(w) = ; w G [0, oo) , 



and finally 

— 1/2 

fw{w) = // — ^ ; w G [0, oo) 

This pdf is called a chi-squared pdf with one degree of freedom.) Observe 
that the functional form of the pdf is valid only for the given domain. By 
implication the pdf is zero outside the given domain — in this example, 
negative values of W cannot occur. One should always specify the domain 
of the dummy variable of a pdf; otherwise the description is incomplete. 

In practice one is likely to encounter the following trick for deriving 
densities for certain simple one-dimensional problems. The approach can 
be used whenever the random variable is a monotonic (increasing or de- 
creasing) function of its argument. Suppose first that we have a random 
variable Y = g{X), where 5 is a monotonic increasing function and that 
g is differentiable. Since g is monotonic, it is invertible and we can write 
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X = g ^(1^), that is, X = g ^{y) is the value of x for which g(x) = y. Then 

Friy) = Pr{g{X) < y) 

= Pr{X < g~\y)) 

= Fx{g-\y)) 




From (3.39) the density can be found as 

fviy) = ^Fviy) = fx{g~'^{y)) ^^ , ■ 

ay dy 

A similar result can be derived for a monotone decreasing g except that a 
minus sign results. The final formula is that ifY = g{X) and g is monotone, 
then 

friy) = fx{ 9 -\y))\^^^\. (3.41) 

dy 

This result is a one-dimensional special case of the so-called Jacobian 
approach to derived distributions. The result could be used to solve the 
previous problem by separately considering negative and nonnegative values 
of the input r since is a monotonic increasing function for nonnegative 
r and monotonic decreasing for negative r. As in this example, the direct 
approach from the inverse image formula is often simpler than using the 
Jacobian “shortcut,” unless one is dealing with a monotonic function. 

It can be seen that although the details may vary from application 
to application, all derived distribution problems are solved by the general 
formula (3.22). In some cases the solution will result in a pmf; in others 
the solution will result in a pdf. 

To review the general philosophy, one uses the inverse image formula 
to compute the probability of an output event. This is accomplished by 
finding the probability with respect to the original probability measure of 
all input events that result in the given output event. In the discrete case 
one concentrates on output events of the form X = x and thereby finds a 
pmf. In the continuous case, one concentrates on output events of the form 
X < X and thereby finds a cdf. The pdf is then found by differentiating. 

[ 3 . 14 ] As a final example of derived distributions, suppose that we are 
given a probability space (fl, B{fl), P) with C 3?. Define the identity 
mapping A : ^ by A(w) = ui. The identity mapping on the real 

line with the Borel field is always a random variable because the 
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measurability requirement is automatically satisfied. Obviously the 
distribution Px is identical to the original probability measure P. 
Thus all probability spaces with real sample spaces provide examples 
of random variables through the identity mapping. A random variable 
described in this form instead of as a general function (not the identity 
mapping) on an underlying probability space is called a “directly 
given” random variable. 



3.4 Random Vectors and Random Processes 

Thus far we have emphasized random variables, scalar functions on a sam- 
ple space that assume real values. In some cases we may wish to model 
processes or measurements with complex values. Complex outputs can be 
considered as two-dimensional real vectors with the components being the 
real and imaginary parts or, equivalently, the magnitude and phase. More 
generally, we may have fc— dimensional real vector outputs. Given that a 
random variable is a real-valued function of a sample space (with a tech- 
nical condition), that is, a function mapping a sample space into the real 
line 5ft, the obvious random vector definition is a vector-valued function 
definition. Under this definition, a random vector is a vector of random 
variables, a function mapping the sample space into 5ft^ instead of 5ft. Yet 
even more generally, we may have vectors that are not finite dimensional, 
e.g., sequences and waveforms whose values at each time are random vari- 
ables. This is essentially the definition of a random process. Fundamentally 
speaking, both random vectors and random processes are simply collections 
of random variables defined on a common probability space. 

Given a probability space (f2,5F, P), a finite collection of random vari- 
ables {Xi] i = 0, 1, . . . ,k — 1} is called a random vector.. We will often 
denote a random vector in boldface as X. Thus a random vector is a 
vector- valued function X : U — > 5ft* defined by X = {Xq, Xi, . . . , Xk_i) 
with each of the components being a random variable. It is also common 
to use an ordinary X and let context indicate whether X has dimension 1 or 
not. Another common notation for the /c-dimensional random vector is X*. 
Each of these forms is convenient in different settings, but we begin with 
the boldface notation in order to distinguish the now new idea of random 
vectors from the scalar case. As we progress, however, the non-boldface no- 
tation will be used with increasing frequency to match current style. The 
boldface notation is still found, but it is far less common then it used to be. 
When vectors are used in linear algebra manipulations with matrices and 
other vectors, we will assume that they are column vectors so that strictly 
speaking the vector should be denoted X = {Xq,Xi, . . . ,Xfc_i)‘, where t 
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denotes transpose. 

A slightly different notation will ease the generalization to random pro- 
cesses. A random vector X = (Ao,Ai,... ,Xk-i) can be defined as an 
indexed family of random variables {Xf, i G T} where T is the index set 
2fe = {0,l,... ,k—l}. The index set in some examples will correspond to 
time; e.g., Xi is a measurement on an experiment at time i for k different 
times. We get a random process by using the same basic definition with 
an infinite index set, which almost always corresponds to time. A ran- 
dom process or stochastic process is an indexed family of random variables 
{Xi; t G T} or, equivalently, {X{t); t G T}, defined on a common probabil- 
ity space {0,iF,P). The process is said to be discctc time if T is discrete, 
e.g., or Z, and continuous time if the index set is continuous, e.g., 3? or 
[0,oo). A discrete time random process is often called a time series. It is 
said to be discrete alphabet or discrete amplitude if all finite-length random 
vectors of random variables drawn from the random process are discrete 
random vectors. The process is said to be continuous alphabet or continu- 
ous amplitude if all finite-length random vectors of random variables drawn 
from the random process are continuous random vectors. The process is 
said to have a mixed alphabet if all finite-length random vectors of random 
variables drawn from the random process are mixture random vectors. 

Thus a random process is a collection of random variables indexed by 
time, usually into the indefinite future and sometimes into the infinite past 
as well. For each value of time t, Xt or X{t) is a random variable. Both 
notations are used, but Xt or A„ is more common for discrete time processes 
whereas X{t) is more common for continuous time processes. It is useful to 
recall that random variables are functions on an underlying sample space 
O, and hence implicitly depend on w G fl. Thus a random process (and a 
random vector) is actually a function of two arguments, written explicitly 
as X{t,uj); t G T,lo G 0 {or Xt{uj) — we use the first notation of the 
moment). Observe that for a fixed value of time, X{t,u>) is a random 
variable whose value depends probabilistically on w. On the other hand, if 
we fix Lo and allow t to vary deterministically, we have either a sequence (T 
discrete) or a waveform (T continuous). If we fix both t and lo, we have a 
number. Overall we can consider a random process as a two-space mapping 
A : n X T ^ 3? or as a one-space mapping X : O ^ 3?^ from sample space 
into a space of sequences or waveforms. 

There is a common notational ambiguity and hence confusion when 
dealing with random processes. It is the same problem we encountered 
with functions in the context of random variables at the beginning of the 
chapter. The notation X{t) or Xt usually means a sample of the random 
process at a specified time t, i.e., a random variable, just as sint means the 
sine of a specified value t. Often in the literature, however, the notation is 
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used as an abbreviation for {X{t); t G T} of [Xt] t G T}, that is, for the 
entire random process or family of random variables. The abbreviation is 
the same as the common use of sint to mean {sint; t G (— 00 , 00 )}, that is, 
the entire waveform and not just a single value. In summary, the common 
(and sometimes unfortunate) ambiguity is in whether or not the dummy 
variable t means a specific value or is implicitly allowed to vary over its 
entire domain. Of course, as noted at the beginning of the chapter, the 
problem could be avoided by reserving a different notation to specify a 
fixed time value, say tg, but this is usually not done to avoid a proliferation 
of notation. In this book we will attempt to avoid the potential confusion by 
using the abbreviations {X(t)} and {Xt} for the random processes when 
the index set is clear from context and reserving the notation X(f) and 
Xt to mean the random variable of the process, that is, the sample of 
the random process at time t. The reader should beware in reading other 
sources, however, because this sloppiness will undoubtedly be encountered 
at some point in the literature; when this happens one can only hope that 
the context will make the meaning clear. 

There is also an ambiguity regarding the alphabet of the random pro- 
cess. If X{t) takes values in Af, then strictly speaking the alphabet of 
the random process is Oter space of all possible waveforms or se- 

quences with coordinate taking values in At. If all of the At are the same 
say At = A, this process alphabet is A^ . In this case, however, the alpha- 
bet of the process is commonly said to be simply A, the set of values from 
which all of the coordinate random variables are drawn. We will frequently 
use this convention. 



3.5 Distributions of Random Vectors 

Since a random vector takes values in a space 3?^, analogous to random 
variables one might expect that the events in this space, that is, the mem- 
bers of the event space should inherit a probability measure from 

the original probability space. This is in fact true. Also analogous to the 
case of a random variable, the probability measure is called a distribution 
and is defined as 

Px(A) = P(X-i(F)) 

= P{{uj : X(u;) G T’}) 

= P({a;: (Xo(u;),Ai(u;),... ,Afe_i(u;)) GF}), (3.42) 

F G , 

where the various forms are equivalent and all stand for Pr(X G F). Equa- 
tion (3.42) is the vector generalization of the inverse image equation (3.22) 
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for random variables. Hence (3.42) is the fundamental formula for deriving 
vector distributions, that is, probability distributions describing random 
vector events. Keep in mind that the random vectors might be composed 
of a collection of samples from a random process. 

By definition the distribution given by (3.22) is valid for each compo- 
nent random variable, but this does not immediately imply, however, that 
the distribution given by (3.42) for events on all components together is 
valid. As in the case of a random variable, the distribution will be valid if 
the output events F € B{^)^ have inverse images under X that are input 
events, that is, if X~^(F) G T for every F G The following subsec- 

tion treats this subtle issue in further detail, but the only crucial point for 
our purposes is the following. Given that we consider real-valued vectors 
X = (Nq, Xi, . . . ,Afc_i), knowing that each coordinate X^ is a random 
variable (i.e., X~^(F) for each real event F) guarantees that X~^(F) G T 
for every F G ,8(3?)^ and hence the basic derived distribution formula is 
valid for random vectors. 

3.5.1 TtMultidimensional Events 

From the discussion following example [2.11] we can at least resolve the 
issue for certain types of output events, viz., events that are rectangles. 
Rectangles are special events in that the values assumed by any component 
in the event are not constrained by any of the other components (compare 
a two-dimensional rectangle with a circle, as in problem 2.31). Specifically 
F G B(Sf{)^ is a rectangle if it has the form 



fc-i fc-i 

F = {x : Sj G E; z = 0, 1, . . . , fc - 1} = f] {x : s* G Ki} = , 

i=0 i=0 

where all G B(^); z = 0, 1, . . . , fc — 1 (refer to Figure 2.3(d) for a two- 
dimensional illustration of such a rectangle). Because inverse images pre- 
serve set operations A. 12, the inverse image of F can be specified as the 
intersection of the inverse images of the individual events: 

fc-i 

X-i(F) = {u;: X,(co) G F,; i = 0, 1, ■ ■ ■ , k - 1} = f| 

i=0 

Since the X^ are each random variables, the inverse images of the individual 
events X~^(Fi) must all be in F. Since F is an event space, the intersection 
of events must also be an event, and hence X~^(F) is indeed an event. 

Thus we conclude that the distribution is well defined for rectangles. 
As to more general output events, we simply observe that a result from 
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measure theory ensures that if (1) inverse images of rectangles are events 
and (2) rectangles are used to generate the output event space then the 
inverse images of all output events are events. These two conditions are 
satisfied by our definition. Thus the distribution of the random vector X 
is well defined. Although a detailed proof of the measure theory result 
will not be given, the essential concept can be given: Any event in T can 
be approximated arbitrarily closely by finite unions of rectangles (e.g., a 
circle can be approximated by lots of very small squares) . The union of the 
rectangles is an event. Finally, the limit of the events as the approximation 
gets better must also be an event. 

3.5.2 Multidimensional Probability Functions 

Given a probability space and a random vector X : ^ 3?^, we 

have seen that there is a probability measure Px that the random vector 
inherits from the original space. With the new probability measure we 
define a new probability space (3?*, Px)- As in the scalar case, the 
distribution can be described by probability functions, that is, cdf’s and 
either pmf’s or pdf’s (or both). If the random vector has a discrete range 
space, then the distribution can be described by a multidimensional pmf 
px(x) = Px({x}) = Pr(X = x) as 

Px{F) = X^Px(x) 

xeF 

^ ^ PXq ,Xi ,.. . , 1 (^0 5 7 ■ ■ • l) 7 

{xo,xi,... ,Xk-i)eF 

where the last form points out the economy of the vector notation of the 
previous line. If the random vector X has a continuous range space, then 
in a similar fashion its distribution can be described by a multidimensional 
pdf /x with 

-Px(P) = [ /x(x)dx. 

Jf 

In order to derive the pdf from the distribution, as in the scalar case, we 
use a cdf. 

Given a A:— dimensional random vector X, define its cumulative distri- 
bution function Px by 

Ax(a) = Axo,Xi,... ,jffc_i(ao7 «i7 ■ • ■ 7 «fc-i) 

= Px({x : Xj < a*; t = 0, 1, . . . , fc - 1}) . 

In English, Px(x) = Pr(Ai < i = 0, 1, . . . ,k—l). Note that the cdf for 
any value of its argument is the probability of a special kind of rectangle. 
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For example, if we have a two-dimensional random vector (X, Y), then the 
cdf Fx,Y{oi,(i) = Pr(X < a, Y < /3) is the probability of the semi-infinite 
rectangle {{x,y) '■ x < a, y < (3}. 

Observe that we can also write this probability in several other ways, 

e.g., 

( k-l 
i=0 

= P{{lo : Xi{u) < Xi; i = 0, 1, . . . , fc - 1}) 

( fc-i 
i=0 

Since integration and differentiation are inverses of each other, it follows 
that 




gk 



dxodxi . . . dxk-i 



(^Oj ■ 5 — l) ■ 



As with random variables, random vector can, in general, have dis- 
crete and continuous parts with a corresponding mixture distribution. We 
will concentrate on random vectors that are described completely by either 
pmf’s or pdf’s. Also as with random variables, we can always unify notation 
using a multidimensional Stieltjes integral to write 

Px{F) = f dFx(x) ; F G , 

J F 

where the integral is defined as the usual integral if X is described by a 
pdf, as a sum if X is described by a pmf, and by a weighted average if 
X has both a discrete and a continuous part. Random vectors are said to 
be continuous, discrete, or mixture random vectors in accordance with the 
above analogy to random variables. 



3.5.3 Consistency of Joint and Marginal Distribntions 

By definition a random vector X = {Xq,Xi, . . . ,Xk-i) is a collection of 
random variables defined on a common probability space (0,F, P). Alter- 
natively, X can be considered to be a random vector that takes on values 
randomly as described by a probability distribution Px> without explicit 
reference to the underlying probability space. Either the original proba- 
bility measure P or the induced distribution Px can be used to compute 
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probabilities of events involving the random vector. Px in turn may be in- 
duced by a pmf px or a pdf /x- From any of these probabilistic descriptions 
we can find a probabilistic description for any of the component random 
variables or any collection of thereof. For example, given a value of i in 
{0,1,... ,k— 1}, the distribution of the random variable Xi is found by 
evaluating the distribution Px for the random vector on one-dimensional 
rectangles where only the component Xi is constrained to lie in some set — 
the rest of the components can take on any value. That is, Px is evaluated 
on rectangles of the form (x = (a;o, ■ ■ . , Xk-i) ■ Xi & G} for any G G P(3?) 
as 



Px,{G) = Px({x : Xi G G}) , G G P(3?) . (3.43) 

Of course the probability can also be evaluated using the underlying prob- 
ability measure P via the usual formula 



Px,{G)=P{X-\G)). 



Alternatively, we can consider this a derived distribution problem on 
the vector probability space (3?^, Px) using a sampling function Hi : 

3?^ ^ 3? as in example [3.4]. Specifically, let Ili(X) = Xi. using (3.22) we 
write 



Pn,{G) = Px(n-i(G)) = Px({x : x, G G}) . (3.44) 

The two formulas (3.43) and (3.44) demonstrate that Ili and Xi are equiv- 
alent random variables, and indeed they correspond to the same physical 
events — the outputs of the coordinate of the random vector X. They 
are related through the formula IIi(X(u;)) = Xiiuj). Intuitively, the two 
random variables provide different models of the same thing. As usual, 
which is “better” depends on which is the simpler model to handle for a 
given problem. 

Another fundamental observation implicit in these ruminations is that 
there are many ways to compute the probability of a given event such 
as “the zth coordinate of the random vector X takes on a value in an 
event P,” and all these methods must yield the same answer (assuming no 
calculus errors) because they all can be referred back to a common def- 
inition in terms of the underlying probability measure P. This is called 
consistency; the various probability measures (P, Px,, and Px) are all 
consistent in that they assign the same number to any given physical event 
for which they all are defined. In particular, if we have a random pro- 
cess {Xt; t G T}, then there is an infinite number of ways we could form 
a random vector (A^g, . . . , Aj^. ^) by choosing a finite numbers k and 
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sample times ■ ■ ■ j tk-i and each of these would result in a correspond- 
ing fc-dimensional probability distribution Pxtg,Xt^,... The calculus 

derived from the axioms of probability implies that all of these distributions 
must be consistent in the same sense, i.e., all must yield the same answer 
when used to compute the probability of a given event. 

The distribution Pjc . of a single component Xi of a random vector X 
is referred to as a marginal distribution, while the distribution Px of the 
random vector is called a joint distribution.. As we have seen, joint and 
marginal distributions are related by consistency with respect to the original 
probability measure, i.e., 

Px,{G) = Px({x : Xi G G}) = P({w : A,(w) G G}) = Pr(A, G G). (3.45) 

For the cases where the distributions are induced by pmf’s (marginal 
pmf’s and joint pmf’s) or pdf’s (marginal pdf’s or joint pdf’s), the relation 
becomes, respectively. 



PXi(a) = 






PXq ,X i,... .Xf,_i (^0 , j ■ ■ ■ 5 a^i— 1 , cr, , . . . , i 



or 

fx^a) = [ 

Jxq,... ,Xk_i 

f X q.... .Xk_ \ (^0 ; ■ ■ • j ^i— 1 , CK, , . . . ,X}^— 1 )dxo . . . dXi—\dXiJ,-\ . . . dxf„— 1 

That is, one sums or integrates over all of the dummy variables correspond- 
ing to the unwanted random variables in the vector to obtain the pmf or pdf 
for the random variable Xi. The two formulas look identical except that 
one sums for discrete random variables and the other integrates for contin- 
uous ones. We repeat the fact that both formulas are simple consequences 
of (3.45). 

One can also use (3.43) to derive the cdf of Xi by setting G = (— oo, a]. 
The cdf is 

Fxiio-) = Px(oo,oo,... ,00,0,00,... ,oo) , 

where the a appears in the position. This equation states that Pr{Xi < 
a) = Pr{Xi < a and Xj < oo), all j yf i. The expressions for pmf’s and 
pdf’s also can be derived from the expression for cdf’s. 

The details of notation with k random variables can cloud the meaning 
of the relations we are discussing. Therefore we rewrite them for the special 
case of fc = 2 to emphasize the essential form. Suppose that {X, Y) is a 
random vector. Then the marginal distribution of X is obtained from 
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the joint distribution of X and Y by leaving Y unconstrained, i.e., as in 
equation (3.43): 

Px{F) = Px,Y{{{x,y) : xGF}); F G ,8(3?) . 

Furthermore, the marginal cdf of X is 

Fx{a) = Fx,F(a,oo) . 

If the range space of the vector (X, X) is discrete, the marginal pmf of X 
is 

Px{x) = '^px,y{x,v) . 
y 

If the range space of the vector (X, Y) is continuous and the cdf is differ- 
entiable, the marginal pdf of X is 

/ OO 

fx.y{x,y)dy , 

-OO 

with similar expressions for the distribution and probability functions for 
the random variable Y . 

In summary, given a probabilistic description of a random vector, we 
can always determine a probabilistic description for any of the component 
random variables of the random vector. This follows from the consistency 
of probability distributions derived from a common underlying probabil- 
ity space. It is important to keep in mind that the opposite statement is 
not true. As considered in the introduction to this chapter, given all the 
marginal distributions of the component random variables, we cannot find 
the joint distribution of the random vector formed from the components 
unless we further constrain the problem. This is true because the marginal 
distributions provide none of the information about the interrelationships 
of the components that is contained in the joint distribution. 

In a similar manner we can deduce the distributions or probability func- 
tions of “sub- vectors” of a random vector, that is, if we have the distribution 
for X = (Aq, Ai, . . . , Afc_i) and if k is big enough, we can find the distribu- 
tion for the random vector (Ai, A 2 ) or the random vector (A 5 , Aio, A 15 ), 
and so on. Writing the general formulas in detail is, however, tedious and 
adds little insight. The basic idea, however, is extremely important. One 
always starts with a probability space ( 0 , 1 F, P) from which one can pro- 
ceed in many ways to compute the probability of an event involving any 
combination of random variables defined on the space. No matter how one 
proceeds, however, the probability computed for a given event must be the 
same. In other words, all joint and marginal probability distributions for 
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random variables on a common probability space must be consistent since 
they all follow from the common underlying probability measure. For ex- 
ample, after finding the distribution of a random vector X. the marginal 
distribution for the specific component Xi can be found from the joint 
distribution. This marginal distribution must agree with the marginal dis- 
tribution obtained for Xi directly from the probability space. As another 
possibility, one might first find a distribution for a sub vector containing 
Xi, say the vector Y = {Xi-i, Xi, Xi+i). This distribution can be used to 
find the marginal distribution for Xi. All answers must be the same since 
all can be expressed in the form P{X~^{F)) using the original probability 
space must be consistent in the sense that they agree with one another on 
events. 

Examples: Marginals from Joint 

We now give examples of the computation of marginal probability functions 
from joint probability functions. 

[ 3 . 15 ] Say that we are given a pair of random variables X and Y such that 
the random vector {X, Y) has a pmf of the form {X, Y) has a pmf of 
the form 

Px,y{x,v) = r{x)q{y) , 

where r and q are both valid pmf’s. In other words, px,Y is a product 
pmf. Then it is easily seen that 

Px{x) = '^px,Y{x,y) = '^r{x)q{y) 

y y 

= x{x)'^q{y) = r{x) 

V 

Thus in the special case of a product distribution, knowing the marginal 
pmf’s is enough to know the joint distribution. 

[ 3 . 16 ] Consider flipping two fair coins connected by a piece of rubber that 
is fairly flexible. Unlike the example where the coins were soldered 
together, it is not certain that they will show the same face; it is, 
however, more probable. To quantify the pmf, say that the probability 
of the pair (0,0) is .4, the probability of the pair (1,1) is .4, and the 
probabilities of the pairs (0,1) and (1,0) are each .1. As with the 
soldered-coins case, this is clearly not a product distribution, but a 
simple computation shows that as in example [3.15], px and py both 
place probability 1/2 on 0, and 1. Thus this distribution, the soldered- 
coins distribution, and the product distribution of example [3.15] all 
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yield the same marginal pmf’s! The point again is that the marginal 
probability functions are not enough to describe a vector experiment, 
we need the joint probability function to describe the interrelations 
or dependencies among the random variable. 

[ 3 . 17 ] A gambler has a pair of very special dice: the sum of the two dice 
comes up as seven on every roll. Each die has six faces with values 
in A = {1, 2, 3, 4, 5, 6}. All combinations have equal probability; e.g., 
the probability of a one and a six has the same probability as a three 
and a four. Although the two dice are identical, we will distinguish 
between them by number for the purposes of assigning two random 
variables. The outcome of the roll of the first die is denoted X and 
the outcome of the roll of the second die is called Y so that {X,Y) is 
a random vector taking values in , the space of all pairs of numbers 
drawn from A. The joint pmf of X and Y is 

Px,y{x, y) = C,x + y = 7, (x, y) € A^ , 

where C is a constant to be determined. The pmf of X is determined 
by summing the pmf with respect to y. However, for any given X G A, 
the value of Y is determined: viz., Y = 7 — X. Therefore the pmf of 
X is 

Px{x) = 1/6, X G A . 

Note that this pmf is the same as one would derive for the roll of a 
single unbiased die! Note also that the pmf for Y is identical with that for 
X. Obviously, then, it is impossible to tell that the gambler is using unfair 
dice as a pair from looking at outcomes of the rolls of each die alone. The 
joint pmf cannot be deduced from the marginal pmf’s alone. 

[ 3 . 18 ] Let (A, Y) be a random vector with a pdf that is constant on the 
unit disk in the XY plane; i.e., 

fx,Y{x,y) = C,x'^ + y'^ <l . 

The constant C is determined by the requirement that the pdf inte- 
grate to 1; i.e., 

I C dxdy = 1 . 

Jx^+y'^<l 

Since this integral is just the area of a circle multiplied by C, we have 
immediately that C = 1/tt. For the moment, however, we leave the 
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joint pdf in terms of C and determine the pdf of X in terms of C by 
integrating with respect to y: 

+ (1_^2)1/2 

fx{x) = / Cdj/ = 2C(1 - , x"^ < 1 . 

i-(l-a;2)i/2 

Observe that we could now also find C by a second integration: 

^+1 

J 2C{l-xY/'^dx = TTC=l , 

or C = TT~^ . Thus the pdf of X is 

/x(x) = 2^-1(1 -x2)1/2 ^ ^ 

By symmetry Y has the same pdf. Note that the marginal pdf is 
not constant, even though the joint pdf is. Furthermore, it is obvious 
that it would be impossible to determine the joint density from the 
marginal pdf’s alone. 

[3.19] Consider the two-dimensional Gaussian pdf of example [2.17] with 
fc = 2, m = (0,0)*, and A = {A(i,j) : A(l, 1) = A(2,2) = 1,A(1,2) = 

A(2, 1) = p}. Since the inverse matrix is 

1 P]~^ ^ 1 [1 -p' 

plj l-p2[-p ij’ 

the joint pdf for the random vector {X,Y) is 

p is called the “correlation coefficient” between X and Y and must 
satisfy < 1 for A to be positive definite. To find the pdf of X we 
complete the square in the exponent so that 

fx,Y{x,y) = ((2^)2(l-p2))-l/2g-[(y-px)V2(l-p^)]-.V2 

= ((2^)(l-p2))-i/2e-[(y-p-)V2(i-P=)](27r)-i/2e-(i/2).=^ . 

The pdf of X is determined by integrating with respect to y on 
(— 00 , 00 ). To perform this integration, refer to the form of the one- 
dimensional Gaussian pdf with m = px (note that x is fixed while the 
integration is with respect to y) and = 1 — p^. The first factor in 
the preceding equation has this form. Because the one-dimensional 
pdf must integrate to one, the pdf of X that results from integrating y 
out from the two-dimensional pdf is also a one-dimensional Gaussian 
pdf; i.e., 

fx{x) = (27r)"^/2e"’'"/^ . 
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As in examples [3.16], [3.17], and [3.18], Y has the same pdf as X. Note 
that by varying p there is a whole family of joint Gaussian pdf’s with the 
same marginal Gaussian pdf’s. 



3.6 Independent Random Variables 

In chapter 2 it was seen that events are independent if the probability of a 
joint event can be written as a product of probabilities of individual events. 
The notion of independent events provides a corresponding notion of inde- 
pendent random variables and, as will be seen, results in random variables 
being independent if their joint distributions are product distributions. 

Two random variables X and Y defined on a probability space are in- 
dependent if the events X~^{F) and Y~^{G) are independent for all F and 
G in ,6(3?). A collection of random variables {Xi,i = 0, 1, . . . ,k — 1} is 
said to be independent or mutually independent if all collections of events 
of the form {X~^{Fi); i = 0, 1 , . . . ,k — 1} are mutually independent for 
any T) G 6(3?); i = 0, 1, . . . ,k — 1. 

Thus two random variables are independent if and only if their output 
events correspond to independent input events. Translating this statement 
into distributions yields the following: 

Random variables X and Y are independent if and only if 

Px,y(Fi X F 2 ) = Px(Fi)Pv(F2) ,all 61,62 e 6(3?) . 

Recall that 61 x 62 is an alternate notation for Yil=i Fi — we will 
frequently use the alternate notation when the number of product events is 
small. Note that a product and not an intersection is used here. The reader 
should be certain that this is understood. The intersection is appropriate 
if we refer back to the original uj events, that is, using the inverse image 
formula to write this statement in terms of the underlying probability space 
yields 

6 (A-i( 6 i) n 6 - 1 ( 62 ) = 6 (A-i( 6 i)) n 6 - 1 ( 62 )). 

Random variables Xq, . . . ,X^_i are independent or mutually indepen- 
dent if and only if 



^Xq,... ,Xk-i 




k-l 



l[PxM ) ; 

i=0 



for all Fi G 6(3?); z = 0, 1, . . . ,k — 1. 

The general form for distributions can be specialized to pmf’s, pdf’s, 
and cdf’s as follows. 
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Two discrete random variables X and Y are independent if and only if 
the joint pmf factors as 

Px,y{x,v) = px{x)pY{y) all x,y . 

A collection of discrete random variables Xp, z = 0, 1, . . . , fc — 1 is mutually 
independent if and only if the joint pmf factors as 

fc-i 

,Xfc_i) = Y[pXi(xi) ; all Xi . 

i^O 

Similarly, if the random variables are continuous and described by pdf’s, 
then two random variables are independent if and only if the joint pdf 
factors as 

fxy{x,y) = fx{x)fYiy) ; all x,z/ G 3? . 

A collection of continuous random variables is independent if and only if 
the joint pdf factors as 

k-l 

fxo,...,X^Xxo,--- ,Xk-l) = YifXii^i) ■ 

i=0 

Two general random variables (discrete, continuous, or mixture) are 
independent if and only if the joint cdf factors as 

Fxx{x,y) = Fx{x)FYiy) ; all a;,y G 3? . 

A collection of general random variables is independent if and only if the 
joint cdf factors 

fc-i 

. . ,Xk-i) = WFx,{xi) ; all (xq, xi, . . . , x/e-i) G 3?^ . 

We have separately stated the two-dimensional case because of its sim- 
plicity and common occurrence. The student should be able to prove the 
equivalence of the general distribution form and the pmf form. If one does 
not consider technical problems regarding the interchange of limits of inte- 
gration, then the equivalence of the general form and the pdf form can also 
be proved. 

3.6.1 IID Random Vectors 

A random vector is said to be independent, identically distributed or iid 
if the coordinate random variables are independent and identically dis- 
tributed; that is, if 
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• the distribution is a product distribution, i.e., it has the form 

( k-l \ k-1 

n = n 

i=0 / i=0 

for all choices of Fi e = 0, 1, . . . , fc — 1, and 

• if all the marginal distributions are the same (the random variables are 
all equivalent), i.e., if there is a distribution Px such that Px^F) = 
Px{F); all F e S(3?) for all i. 

For example, a random vector will have a product distribution if it has a 
joint pdf or pmf that is a product pdf or pmf as described in example [2.16]. 
The general property is easy to describe in terms of probability functions. 
The random vector will be iid if it has a joint pdf with the form 

/x(x) ^Y[fx{xi) 

I 

for some pdf fx defined on 3? or if it has a joint pmf with the form 

Px(x) = Y[px{xi) 

I 

for some pmf px defined on some discrete subset of the real line. Both of 
these cases are included in the following statement: A random vector will 
be iid if and only if its cdf has the form 

■Fx(x) =\\Fx{xi) 

i 

for some cdf Fx. 

Note that, in contrast with earlier examples, the specification “product 
distribution,” along with the marginal pdf’s or pmf’s or cdf’s, is sufficient 
to specify the joint distribution. 

3.7 Conditional Distributions 

The idea of conditional probability can be used to provide a general rep- 
resentation of a joint distribution as a product, but a more complicated 
product than arises with an iid vector. As one would hope, the compli- 
cated form reduces to the simpler form when the vector is in fact iid. The 
individual terms of the product have useful interpretations. 

The use of conditional probabilities allows us to break up many problems 
in a convenient form and focus on the relations among random variables. 
Examples to be treated include statistical detection, statistical classifica- 
tion, and additive noise. 
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3.7.1 Discrete Conditional Distributions 

We begin with the discrete alphabet case as elementary conditional proba- 
bility suffices in this simple case. We can derive results that appear similar 
for the continuous case, but nonelementary conditional probability will be 
required to interpret the results correctly. 

Begin with the simple case of a discrete random vector (X, Y) with 
alphabet Ax x Ay described by a pmf px,y{x, y) ■ Let px and py denote the 
corresponding marginal pmf ’s. Define for each x € Ax for which px (x) > 
0 the conditional pmf py\x{y\x)]y G Ay as the elementary conditional 
probability oiY = y given X = x, that is, 



PY\x{y\x) 



P{Y = y\X = x) 

P{Y = y and X = x) 

P{X = x) 

P({w : Y (w) = j/} n {w : X{lv) = x}) 
P{{lo : X{u!) = x}) 

Px,Y{x,y) 

Px{x) 



(3.46) 



where we have assumed that px{x) > 0 for all suitable x to avoid dividing by 
0. Thus a conditional pmf is just a special case of an elementary conditional 
probability. For each x a conditional pmf is itself a pmf, since it is clearly 
nonnegative and sums to 1: 



PY\x{y\x) 

V&Ay 



E 

yeAy 



px,Yjx,y) 

Px{x) 



1 

Px{x) 

1 

Px{x) 



X] Px,Y{x,y) 
y&Ay 

Px{x) = 1. 



We can compute conditional probabilities by summing conditional pmf ’s, 

i.e.. 



P{y G F|X = x) = ^ PY\x{y\x). (3.47) 

veF 

The joint probability can be expressed as a product as 



Px,Y(x,y) = PY\x{y\x)px{x). 



(3.48) 
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Unlike the independent case, the terms of the product do not each de- 
pend on only a single independent variable. If X and Y are independent, 
then PY\x{y\x) = Priy) and the joint pmf reduces to the product of two 
marginals. 

Given the conditional pmf py\x and the pmf px, the conditional pmf 
with the roles of the two random variables reversed can be computed by 
marginal pmf’s by 



Px\Y{x\y) 



Px,Y{x,y) 

Pviy) 



PY\x{y\x)px{x) 

J2uPY\x{y\u)px{u)' 



(3.49) 



a result often referred to as Bayes ’ rule. 

The ideas of conditional pmf’s immediately extend to random vec- 
tors. Suppose we have a random vector (Xq, Xi, . . . ,Xk-i) with a pmf 
PXo,Xi,... ,Xk-Y then (provided none of the denominators are 0) we can de- 
fine for each 1 = 1 , 2 ,... ,k — 1 the conditional pmf’s 



PXi\Xo,...,Xi-i{xi\xo, . ■ . ,Xi-i) 



PXq,... ,Xi{xq, ... ,Xi) 
PXo,... ,Xi_i(xo, . . . ,Xi-i) 



(3.50) 



Then simple algebra leads to the chain rule for pmf’s: 



PXo.Xi,... ,x„_i(a;o5 X\,. . . Xn-l) 

_ / PXo,X^,... ,X„-i{xo,X\, . . .Xn-l) 

\PXo,Xi,... .Xn.^i^OjXl, . . .Xn- 2 ) 



PXo.Xi,... ,Xri- 2 (xo, Xl,. . . Xn- 2 ) 



= PXt 



(xo) n 



PXq,Xi,... ,Xj (xq, Xl,. . . Xl) 
PXo.Xi,... {xo, Xl, . . . Xt-i) 

n—1 

= PXo{xo)Y[pXi\Xo.....Xi.ii^l\^0,--- ,Xl-l), 



(3.51) 



a product of conditional probabilities. This provides a general form of the 
iid product form and reduces to that product form if indeed the random 
variables are mutually independent. This formula plays an important role 
in characterizing the memory in random vectors and processes. Since it 
can be used to construct joint pmf’s, and can be used to specify a random 
process. 



3.7.2 Continuous Conditional Distributions 

The situation with continuous random vectors is more complicated if rigor 
is required, but the mechanics are quite similar. Again begin with the 
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simple case of two random variables X and Y with a joint distribution, 
now taken to be described by a pdf fx,Y- We define the conditional pdf as 
an exact analog to that for pmf’s: 

. ^ I \ A fx,Y{x,v) . , 

fY\x{y\x) = ( 3 - 52 ) 

This looks the same as the pmf, but it is not the same because pmf’s 
are probabilities and pdf’s are not. A conditional pmf is an elementary 
conditional probability. A conditional pdf is not. It is also not the same as 
the conditional pdf of example [ 2 . 19 ] as in that case the conditioning event 
had nonzero probability. The conditional pdf fy\x can, however, be related 
to a probability in the same way an ordinary pdf (and the conditional pdf 
of example [ 2 . 19 ]) can. An ordinary pdf is a density of probability, it is 
integrated to compute a probability. In the same way, a conditional pdf 
can be interpreted as a density of conditional probability, something you 
integrate to get a conditional probability. Now, however, the conditioning 
event can have probability zero and this does not really fit into the previous 
development of elementary conditional probability. Note that a conditional 
pdf is indeed a pdf, a nonnegative function that integrates to one. This 
follows from 

= -J^fxix) = 1 , 

fx[x) 

provided we require that fx{x) > 0. 

To be more specific, given a conditional pdf /v|x> we will make a ten- 
tative definition of the (nonelementary) conditional probability that Y € F 
given A = X is 

P^YeF\X = x)= f fY\x{y\x)dy. (3.53) 

J F 

Note the close resemblance to the elementary conditional probability for- 
mula in terms of conditional pmf’s of (3.47). For all practical purposes 
(and hence for virtually all of this book), this constructive definition of 
nonelementary conditional probability will suffice. Unfortunately it does 
not provide sufficient rigor to lead to a useful advanced theory. Section 3.17 
discusses the problems and the correct general definition in some depth, but 
it is not required for most applications. 
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Via almost identical manipulations to the pmf case in (3.49), conditional 
pdf’s satisfy a Bayes’ rule: 



fx,Y{x,v) ^ fY\x{y\x)fx{x) 
/y(j/) ! fY\x{y\u)fx{u)du' 



(3.54) 



As a simple but informative example of a conditional pdf, consider 
generalization of Example [3.19] to the case of a two-dimensional vector 
U = (A, y) with a Gaussian pdf having a mean vector {mx,rnYY and a 
covariance matrix 



A = 



<Xx P<xx<xy 
paxo-Y oy 



(3.55) 



where p is called the correlation coefficient of X and Y. Straightforward 
algebra yields 



det(A) = ct|-(t^(1 — p^) 



A-i = 



(l-p2) 






(Txcry 



SO that the two-dimensional pdf becomes 



(3.56) 

(3.57) 



fxY{x,y) 

^ (a:— mx ,y— )A“^ (x— mx )* 

■\/27rdet A 
1 

27rax<TY\/l - p^ 



(3.58) 



X exp I — 



1 

'2(1 -p2) 



- mx Y 

<xx 




mx){y - tuy) 
axo-Y 



( 



y-rriY 

<xy 




A little algebra to rearrange the expression yields 



fxY{x,y) = 



= -5(- 



crxV^ 



' X 



1 ^ /cry (x-mj^) ^2 



<Xy a /1 - p'^y/x 



— cry 



from which it follows immediately that the conditional pdf is 

2^ 1 / y-my-po-y /gx(»-"»x) ^2 



fY\x{y\x) = 



<xyV^ - P'^V^ 



g 2V (l-p‘^)( 7 Y 



(3.60) 



(3.61) 



which is itself a Gaussian density with variance (Jy^x ^ ~ 

mean rriY\x = y ~ txiy + p[pY / 'x x)[x — mx). Integrating y out of the joint 
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pdf then shows that as in Example [3.19] the marginal pdf is also Gaussian: 



fx{x) 



1 1 ( x-mx ■'2 

•'x ' 

<xxVx 



(3.62) 



A similar argument shows that also friv) and fx\Y{x\y) are also Gaussian 
pdf’s. Observe that if X and Y are jointly Gaussian, then they are also 
both individually and conditionally Gaussian! 



A chain rule for pdf’s follows in exactly the same way as that for pmf’s. 
Assuming fxo,Xi,... ,Xi(xo,xi, . . .Xi) > 0, 



fXo,Xi,... ,Xr,-i{xo,Xi, . ..Xn-l) 

_ fxo,Xi,... ,X„-i(xo,Xi, . . ,X„-l) 
fXo,Xi,... ,X„- 2 (xo, Xi, . . . Xn- 2 ) 



fxo,Xi,... ,x„_2(a^0) xi,. . . Xn-2) 



n—1 



= /xo(a;o)n 



fXo,Xi,... ,Xi{xQ,Xi, . . .Xj) 
fXo,Xi,... ,Xi-i(xo,Xi, . . . Xi-i) 



k-1 

= fXo{xo)Y[fxt\Xo,...,Xi-^{xi\xo,..- ,Xi-i), 
Z=1 



(3.63) 



3.8 Statistical Detection and Classification 

As a simple, but nonetheless very important, example of the application of 
conditional probability mass functions describing discrete random vectors, 
suppose that A is a binary random variable described by a pmf px, with 
px(l) = P, possibly one bit in some data coming through a modem. You 
receive a random variable Y, which is the equal to X with probability 1 — e. 
In terms of a conditional pmf this is 

PY\x{y\x) = h (3.64) 

(1-e x = y. 

This can be written in a simple form using the idea of modulo 2 (or mod 2) 
arithmetic which will often be useful when dealing with binary variables. 
Modulo 2 arithmetic or the “Galois field of 2 elements” arithmetic consists 
of an operation 0 defined on the binary alphabet {0,1} as follows: Define 
modulo 2 addition 0 by 



001 

000 



100=1 
10 1 = 0 . 



(3.65) 

(3.66) 
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The operation 0 corresponds to an “exclusive or” in logic; that is, it pro- 
duces a 1 if one or the other but not both of its arguments is 1. Modulo 
2 addition can also be thought of as a parity check, producing a 1 if there 
is an odd number of I’s being summed and a 0 otherwise. An equivalent 
definition for the conditional pmf is 

PYix{y\x) = e^^y{l-ey-^^y-, (3.67) 

For example, the channel over which the bit is being sent is noisy in that it 
occasionally makes an error. Suppose that it is known that the probability 
of such an error to be e. The error might be very small on a good phone line, 
but it might be very large if an evil hacker is trying to corrupt your data. 
Given the observed Y, what is the best guess X{Y) of what is actually sent? 
In other words, what is the best decision rule or detection rule for guessing 
the value of X given the observed value of Y1 A reasonable criterion for 
judging how good an arbitrary rule X is the resulting probability of error 

Pe(A) = Pr(A(F)^A). (3.68) 

A decision rule is optimal if it yields the smallest possible probability of er- 
ror over all possible decision rules. A little probability manipulation quickly 
yields the optimal decision rule. Instead of minimizing the error probability, 
we maximize the probability of being correct: 

Pr(A = A) = l-Pe(A) 

= X! Px,Y{x,y) 

(x,y):X{y)=x 

= Px\Y{x\y)pY{y) 

(x,y):X(y)=x 

= '^PY{y) Px\Y{x\y) 

y \x-.X{y)—x 

V 

To maximize this sum, we want to maximize the terms within the sum 
for each y. Clearly the maximum value of the conditional probability 
Px\Y{X{y)\y)^ m&XuPx\Y{u\y), will be achieved if we define the decision 
rule X{y) to be the value of u achieving the maximum of Px\Y{u\y) over u, 
that is, define X to be argmax„px|v('^|y) (also denoted px\Y{u\y)). 

In words: the optimal estimate of X given the observation Y in the sense 
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of minimizing the probability of error is the most probable value of X given 
the observation. This is called the maximum a posteriori or MAP decision 
rule. In our binary example it reduces to choosing x = y ii e < 1/2 and 
x=l — j/ife>l/2. Ife=l/2 you can give up and flip a coin or make an 
arbitrary decision. (Why?) Thus the minimum (optimal) error probability 
over all possible rules is min(e, 1 — e). 

The astute reader will notice that having introduced conditional pmf’s 
Py\Xi the example considered the alternative pmf px\Y- The two are easily 
related by Bayes’ rule (3.49). 

A generalization of the simple binary detection problem provides the 
typical form of a statistical classification system. Suppose that Nature se- 
lects a “class” H, a random variable described by a pmf pnih), which is no 
longer assumed to be binary. Once the class is selected, Nature then gener- 
ates a random “observation” X according to a prof px\H- For example, the 
class might be a medical condition and the observations the results of blood 
pressure, patients age, medical history, and other information regarding the 
patients health. Alternatively, the class might be an “input signal” put into 
a noisy channel which has the observation X as an “output signal.” The 
question is: Given the observation X = x, what is the best guess H{x) of 
the unseen class? If by “best” we adopt the criterion that the best guess is 
the one that minimizes the error probability Pg = Pr(H(X) yf H), then the 
optimal classifer is again the MAP rule argmax„p^|x(u|a:). More generally 
we might assign a cost Cy^h resulting if the true class is h and we guess y. 
Typically it is assumed that Ch,h = 0, that is, the cost is zero if our guess 
is correct. (In fact it can be shown that this assumption involves no real 
loss of generality.) Given a classifier (classification rule, decision rule) h{x), 
the Bayes risk is then defined as 

B{h) = ^h(x),hPHx{h, a;), (3.69) 

x,h 

which reduces to the probability of error if the cost function is given by 

Cy,h = 1 - 6y,h. (3.70) 

The optimal classifier in the sense of minimizing the Bayes risk is then 
found by observing that the inequality 

Bih) = 

X h 

> ’y^ pxjx) min E Cy,hPH\x{h\^) 

X ^ \ h 
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which lower bound is achieved by the classifier 



h{x) = argmin 
y 




(3.71) 



the minimum average Bayes risk classifier. This reduces to the MAP de- 
tection rule when Cy^h = 1 — ^y,h- 



3.9 Additive Noise 

The next examples of the use of conditional distributions treats the distri- 
butions arising when one random variable (thought of as a “noise” term) is 
added to another, independent random variable (thought of as a “signal” 
term) . This is an important example of a derived distribution problem that 
yields an interesting conditional probability. The problem also suggests a 
valuable new tool which will provide a simpler way of solving many similar 
derived distributions — the characteristic function of random variables. 



Discrete Additive Noise 

Consider two independent random variables X and W and form a new 
random variable Y = X + W. For example, this could be a description of 
how errors are actually caused in a noisy communication channel connecting 
a binary information source to a user. In order to apply the detection 
and classification signal processing methods, we must first compute the 
appropriate conditional probabilities of the outpout Y given the input X. 
To do this we begin by computing the joint pmf of X and Y using the 
inverse image formula: 

Px,y{x, y) = Pr(A = x,Y = y) 

= Vv{X = x,X + W = y) 

= X! Px,w{a,P) 

a,f3:(x—x,a-\-(3=y 
= Px,w{x,y - x) 

= Px{x)pw{y - x). (3.72) 

Note that this formula only makes sense if y — x is one of the values in the 
range space of W . Thus from the definition of conditional pmf’s: 

PY\x{y\x) = Y^=Pw{y-x), (3.73) 

Px{x) 
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an answer that should be intuitive: given the input is x, the output will 
equal a certain value y if and only if the noise exactly makes up the differ- 
ence, i.e., W = y — X. Note that the marginal pmf for the output Y can be 
found by summing the joint probability: 

Py{v) = '^Px,Y{x,y) 

X 

= '^Px{x)pw{y - x), (3.74) 

X 

a formula that is known as a discrete convolution or convolution sum. 

Anyone familiar with convolutions know that they can be unpleasant to 
evaluate, so we postpone further consideration to the next section and turn 
to the continuous analog. 

The above development assumed ordinary arithmetic, but it is worth 
pointing out that for discrete random variables sometimes other types of 
arithmetic are appropriate, e.g., modulo 2 arithmetic for binary random 
variables. The binary example of section 3.8 can be considered as an addi- 
tive noise example if we define a random variable W which is independent 
of X and has a pmf pw{w) = £“(1 — w = 0, 1 and where Y = X + W 
is interpreted as modulo 2 arithmetic, that is, as T = X(BW. This additive 
noise definition is easily seen to yield the conditional pmf of (3.64) and the 
output pmf via a convolution. To be precise, 

Px,y(x, y) = Pr(A = x,Y = y) 

= Pt{X = x,X®W = y) 

= ^ Px,w{c(,f3) 

oc,^:a—x,Oi^^—y 

= Px,w{x,y®x) 

= Px{x)pw{y®x) (3.75) 

and hence 

PYlx(ylx) = (-3 70^ 

Px(x) 

and 

PY(y) = ^px,Y(x,y) 

X 

= '^Px(x)pw(y®x), (3.77) 



a modulo 2 convolution. 
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Continuous Additive Noise 

An entirely analogous formula arises in the continous case. Again suppose 
that A is a random variable, a signal, with pdf fx, and that W is a, random 
variable, the noise, with pdf fw- The random variables X and W are 
assumed to be independent. Form a new random variable Y, an observed 
signal plus noise. The problem is to find the conditional pdf’s fY\x{u\x) 
and fx\Y{x\y). The operation of producing an output Y from an input 
signal X is called an additive noise channel in communications systems. 
The channel is completely described by /v|x- The second pdf, fx\Y will 
prove useful later when we try to estimate X given an observed value of Y . 

Independence of X and W implies that the joint pdf is fx,w{x,w) = 
fx{x)fw{w). To find the needed joint pdf fx,Y, first evaluate the joint 
cdf and then take the appropriate derivative. The cdf is a straightforward 
derived distribution problem: 



Fx,Y{x,y) 



Pr(A <x,Y <y) 

Pr(A <x,X + W<y) 

fx,w(o-,P) da dp 

a,/3:o:<a;,Q:+/3<y 

da [ dPfx{a)fw{P) 



dafx{a)Fw{y - a). 



Taking the derivatives yields 

fx,Y{x,y) = fx{x)fw{y - x) 



and hence 



fY\x{y\x) = fw{y - x). (3.78) 

The marginal pdf for the sum Y = X + W is then found as 

/v(y)=y fx,Y{x,y)dx = J fx{x)fw{v - x)dx, (3.79) 

a convolution integral of the pdf’s fx and fw, analogous to the convo- 
lution sum found when adding independent independent discrete random 
variables. Thus the evaluation of the pdf of the sum of two independent 
continuous random variables is the same as the evaluation of the output 
of a linear system with an input signal fx and an impulse response fw- 
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We will later see an easy way to accomplish this using transforms The pdf 
fx\Y follows from Bayes’ rule: 



fx\Y{x\y) = 



fx{x)fw{y - x) 
f fx{a)fw{y -a) da' 



(3.80) 



It is instructive to work through the details of the previous example for 
the special case of Gaussian random variables. For simplicity the means 
are assumed to be zero and hence it is assumed that fx is Af{0, crx), that 
fw is Af{0, ay), and that as in the Example X and W are independent and 
Y = X + W. From (3.78) 



fY\x{y\x) = fw(y-x) 






7T(J 



(3.81) 



w 



from which the conditional pdf can be immediately recognized as being 
Gaussian with mean x and variance that is, as X'(x, cr^). 

To evlauate the pdf /x|v using Bayes’ rule, we begin with the denomi- 
nator fy of (3.54) and write 



/y(y) = 



fY\x{y\a)fx{a) da 



-iy-a)^ -- 

e ^ 



a/27T(T^ i/27T(? 



da 



1 



2xax<xw J-c 



X 

1 r -2ai/+o 

“2 I T2 

e w 



da 



(3.82) 



27TCTX (Xw 



-hA\A-+A-)-^] . 

e ■'X ‘'w ‘'w da. 



(3.83) 



This convolution of two Gaussian “signals” can be accomplished using an 
old trick called “completing the square.” Gall the integral in the square 
brackets at the end of the above equation I and note that integrand resem- 
bles 

e s'. ,,2 J 

which we know from (B.15) in appendix B integrates to 



=-i(- 



da = V27rcr^ 



(3.84) 
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since a Gaussian pdf integrates to 1. The trick is to modify I to resemble 
this integral with an additional factor. Compare the two exponents: 

2 [“ ( 2 + 2 ) 2 ] 
vs. 

l.a — m.o l.a^ ^am . 

-.( ) = - 2— + — ■ 

2 O' 2 cr^ 

The exponent from / will equal the left two terms of the expanded exponent 
in the known integral if we choose 

1 1 1 



or, equivalently, 



'w 



'X 



a^ = 



2 2 
CTxO-W 



'X 






(3.85) 



and if we choose 



or, equivalently. 






m = 



m 



'w 



Using (3.85) - (3.86) we have that 



2,1 1 , 2ay .a — m .2 "m- 

\ ^ 2 ) 2 ( ) V 



2 

2 ’ 



(3.86) 



where the addition of the leftmost term is called “completing the square.” 
With this identification and again using (3.85) - (3.86) we have that 



I = 






da 



= 72 



7rcr^e2o 



(3.87) 



which implies that 



friy) = 



_ i y 
2 

e w 
2TTaxO'w 
1 



V2 



7Tcr^e2a 



727r(cr^ +(T^) 



~2 '2 , _2 
e , 



(3.88) 
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In other words, /y is Af{0, + cr^) and we have shown that the sum of 

two zero mean independent Gaussian random variables is another zero mean 
Gaussian random variable with variance equal to the sum of the variances 
of the two random variables being added. 

Finally we turn to the a posteriori probability fx\Y- From Bayes’ rule 
and a lot of algebra 



fx\Y{x\y) 



fY\x{y\x)fx{x) 

fxiy) 



-- 



V^27nr^ y/Tn 



V'- 



27l-((T^+cr^) 
1 



_ 1 f -2yx+x 



I 

cjX c 



27T- 







(3.89) 



In words: fx\Y{^\y) is a Gaussian pdf 



Af(- 



'X 



aVa, 



x'-'w 









'w 



' X 



'w 



The mean of a conditional distribution is called a conditional mean and the 
variance of a conditional distribution is called a conditional variance. 



Continuous Additive Noise with Discrete Input 

Additive noise provides a situation in which mixed distributions having 
both discrete and continuous parts naturally arise. Suppose that the signal 
X is binary, say with pmf px{x) = p^{l — pY~^. The noise term W 
is assumed to be a continuous random variable described by pdf fw(w), 
independent of X, with variance cr^. The observation is defined by F" = 
X + W. In this case the joint distribution is not defined by a joint pmf 
or a joint pdf, but by a combination of the two. Some thought may lead 
to the reasonable guess that the continuous observation given the discrete 
signal should be describable by a conditional pdf fY\x{y\x) = fw{y ~ x), 
where now the conditional pdf is of the elementary variety, the given event 
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has nonzero probability. To prove that this is in fact correct, consider the 
elementary conditional probability Pr(y < y\X = x), for a; = 0, 1. This is 
recognizable as the conditional cdf for Y given = x, so that the desired 
conditional density is given by 

fY\x{y\x) = <y\X = x). (3.90) 

The required probability is evaluated using the independence of X and W 
as 



Pr(F < y\X = x) 



Pt{X + W <y\X = x) 
Pr(x + W< y\X = x) 
Pr(lP <y — x) 

Fw{y - x). 



Differentiating gives 



fY\x{y\x) = fw{y-x). 



(3.91) 



The joint distribution is described in this case by a combination of a 
pmf and a pdf. For example, to compute the joint probability that X € F 
and Y G G is accomplished by 

Pr(X G F and y G G) = '^Px{x)f fY\x{y\x)dy 

= '^Pxix) f fw{y-x)dy. (3.92) 

rr G 



Choosing F = 3? yields the output distribution 

Pr(y G G) = ^Px{x) j^fY\x{y\x) dy = '^px{x) J^fwiy - x) dy. 

Choosing G = (— oo,y] provides a formula for the cdf Fy(j/), which can be 
differentiated to yield the output pdf 



fY{y) = Px{x)fY\x{y\x) = '^Px{x)fw{y - x), (3.93) 



a mixed discrete convolution involving a pmf and a pdf (and exactly the 
formula one might expect in this mixed situation given the pure discrete 
and continuous examples). 

Continuing the parallel with the pure discrete and continuous cases, 
one might expect that Bayes’ rule could be used to evaluate the conditional 
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distribution in the opposite direction, which since X is discrete should be 
a conditional pmf: 



Px\Y{x\y) 



fY\x{y\x)px{x) 

fxiy) 



fY\x{y\x)px{x) 

Y.cPx{o:)fY\x{y\a)' 



(3.94) 



Observe that unlike previously treated conditional pmf’s, this one is not 
an elementary conditional probability since the conditioning event does not 
have nonzero probability. Thus it cannot be defined in the original manner, 
but must be justified in the same way as conditional pdf’s, that is, by the 
fact that we can rewrite the joint distribution (3.92) as 



Pr(X € F a,ndY eC) f dy/y (y) Pr(X G F\Y = y) = f dy/y(y)^Px|y 
J G J G p' 

(3.95) 

so that Px\Y{x\y) indeed plays the role of a mass of conditional probability, 
that is. 



Pr{X&F\Y = y) = Y,Px\Y{x\y). (3.96) 

F 



Applying these results to the specific case of the binary input and Gaus- 
sian noise, the conditional pmf of the binary input given the noisy obser- 
vation is 



Px\Y{x\y) 



fwjy - x)px{x) 
fxiy) 



fw{y-x)px{x) 

J2aPxG)fwiy - a)’ 



2/ G G {0, 1}. 

(3.97) 



This formula now permits the analysis of a classical problem in communi- 
cations, the detection of a binary signal in Gaussian noise. 



3.10 Binary Detection in Gaussian Noise 



The derivation of the MAP detector or classifier extends immediately to the 
the situation of a binary input random variable and independent Gaussian 
noise just treated. As in the purely discrete case, the MAP detector X{y) 
of X given Y = y is given by 



x{y) 



argmaxpjfjy (x|y) = argmax 

X X 



fwjy - x)px{x) 
Y.aPx{a)fw{y - a)' 



(3.98) 
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Since the denominator of the conditional pmf does not depend on x (only 
on y), given y the denominator has no effect on the maximization 

X{y) = a,rgma,x px\Y{x\y) = argmax/vy(i/ - x)px{x). 

X X 



Assume for simplicity that X is equally likely to be 0 or 1 so that the rule 
becomes 



1 — i 

X{y) = argmaxpx|F(a;|i/) == argmax^==e ■ 

X X 

The constant in front of the pdf does not effect the maximization. In 
addition, the exponential is a mononotically decreasing function of |a;— j/|, so 
that the exponential is maximized by minimizing this magnitude difference, 
i.e.. 



X{y) = argmaxpx|v(a;| 2 /) == argmin \x - y\, 

X X 



(3.99) 



which yields a final simple rule: see if a; = 0 or 1 is closer to y as the best 
guess of X. This choice yields the MAP detection and hence the minimum 
probability of error. In our example this yields the rule 



x{y) 



0 2/ < 0.5 

1 y > 0.5 ■ 



(3.100) 



Because the optimal detector chooses the x that minimizes the Euclidean 
distance |a;— y| to the observation y, it is called a minimum distance detector 
or rule. Because the guess can be computed by comparing the observation 
to a threshold (the value midway between the two possible values of x), the 
detector is also called a threshold detector. 

Assumptions have been made to keep things fairly simple. The reader 
is invited to work out what happens if the random variable X is biased and 
if its alphabet is taken to be {—1, 1} instead of {0, 1}. It is instructive to 
sketch the conditional pmf’s for these cases. 

Having derived the optimal detector, it is reasonable to look at the 
resulting, minimized, probability of error. This can be found using condi- 
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tional probability: 

Pe = PT{X{Y)y^X) 

= Pr{X{Y) ^ 0|X = O)px(O) + Pr(l(r) ^ 1|X = (1) 

= Pr(y > 0.5|X = O)px(O) + Pr(y < 0.5|X = l)px(l) 

= Pr(bP + X > 0.5|X = O)px(O) + Pr(bP + X < 0.5|X = l)px(l) 
= Pr(bP > 0.5|X = O)px(O) + Pr(VP + 1 < 0.5|X = l)px(l) 

= Pr(bP > 0.5)px(0) + Pr(lP < -0.5)px(l) 



where we have used the independence of W and X. These probabilities can 
be stated in terms of the 4) function of (2.78) as in (2.82), which combined 
with the assumption that X is uniform and (2. 84) yields 



P, = 1(1 - 4>(— ) + 4>(- — )) = 4>(^). 
z aw zaw 



(3.101) 



3.11 Statistical Estimation 

Discrete conditional probabilities were seen to provide method for guessing 
an unknown class from an observation: if all incorrect choices have equal 
costs so that the overall optimality criterion is to minimize the probability of 
error, then the optimal classification rule is to guess that the class X = k, 
where Px\Y{k\y) = px|v(a:|j/), the maximum a posteriori or MAP 

decision rule. There is an analogous problem and solution in the continuous 
case, but the result does not have as strong an interpretation as in the 
discrete case. A more complete analogy will be derived in the next chapter. 

As in the discrete case, suppose that a random variable Y is observed 
and the goal is to make a good guess X{Y) of another random variable X 
that is jointly distributed with Y . Unfortunately in the continuous case it 
does not make sense to measure the quality of such a guess by the proba- 
bility of its being correct because now that probability is usually zero. For 
example, if Y is formed by adding a Gaussian signal X to an independent 
Gaussian noise W to form an observation Y = X + W as in the previous 
section, then no rule is going to recover X perfectly from Y. Nonetheless, 
intuitively there should be reasonable ways to make such guesses in con- 
tinuous situations. Since X is continuous, such guesses are refered to as 
“estimation” or “prediction” of X rather than as “classification” or “detec- 
tion” as used in the discrete case. In the statistical literature the general 
problem is referred to as “regression” . 

One approach is to mimic the discrete approach on intuitive grounds. If 
the best guess in the classification problem of a random variable X given an 
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observation Y is the MAP classifier = argmax^Px|v(a;|?/), then 

a natural analog in the continuous case is the so-called MAP estimator 
defined by 



^MAp(y) = argmax^/x|F(a^|j/), (3.102) 

the value of x maximizing the conditional pdf given y. The advantage 
of this estimator is that it is easy to describe and provides an immediate 
application of conditional pdf’s paralleling that of classification for discrete 
conditional probability. The disadvantage is that we cannot argue that this 
estimate is “optimal” in the sense of optimizing some specified criterion, it 
is essentially an ad hoc (but reasonable) rule. As an example of its use, 
consider the Gaussian signal plus noise of the previous section. There it was 
found that the pdf /x|v(3^|y) is Gaussian with mean ^2 V- Since the 
Gaussian density has its peak at its mean, in this case the MAP estimate 
of X given Y = y \s given by the conditional mean '2 -2 y. 

Knowledge of the conditional pdf is all that is needed to define another 
estimator: the maximum likelihood or ML estimate of X given Y = y is 
defined as the value of x that maximizes the conditional pdf fY\x{y\x), 
the pdf with the roles of input and output reversed from that of the MAP 
estimator. Thus 



^Ml( 2^) = argmax/y|x(2/|a:). (3.103) 

X 

Thus in the Gaussian case treated above, A]y[p(y) = y. 

The main interest in the ML estimator in some applications is that it is 
sometimes simpler and that it does not require any assumption on the input 
statistics. The MAP estimator depends strongly on fx, the ML estimator 
does not depend on it at all. It is easy to see that if the input pdf is uniform, 
the MAP estimator and the ML estimator are the same. 



3.12 Characteristic Functions 

We have seen that summing two random variables produces a new random 
variable whose pmf or pdf is found by convolving the two pmf’s or pdf’s 
of the original random variables. Anyone with an engineering background 
will likely have had experience with convolution and recall they can be 
somewhat messy to evaluate. To make matters worse, if one wishes to sum 
additional independent random variables to the existing sum, say form 
Y = from an iid collection {A^}, then the result will be an N- 

fold convolution, a potential nightmare in all but the simplest of cases. As 
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in other engineering applications such as circuit design, convolutions can be 
avoided by Fourier transform methods and in this subsection we describe 
the method as an alternative approach for the examples to come. We begin 
with the discrete case. 

Historically the transforms used in probability theory have been slightly 
different from those in traditionally Fourier analysis. For a discrete random 
variable with pmipx, define the characteristic function Mx of the random 
variable (or of the pmf) as 

Mxiju) = (3.104) 

X 

where u is usually assumed to be real. Recalling the definition (2.34) 
of the expectation of a function g defined on a sample space, choosing 
g{u)) = shows that the characteristic function can be be more sim- 

ply defined as 

Mxiju) = E[e^^^\. (3.105) 

Thus characteristic functions, like probabilities, can be viewed as special 
cases of expectations. 

This transform, which is also referred to as an exponential transform or 
operational transform, bares a strong resemblance to the discrete-parameter 
Fourier transform 





TAvx) = ^px(x)e-^2... 
X 


(3.106) 


and the z-transform 








Zzivx) = '^pxix)z^. 

X 


(3.107) 


In particular, Mxiju) 


= iF-27vuipx) = Z^juipx). As a result. 


all of the 



properties of characteristic functions follow immediately from (are equiva- 
lent to) similar properties from Fourier or z transforms. As with Fourier 
and z transforms, the original pmf px can be recovered from the transform 
Mx by suitable inversion. For example, given a pml px{k); k € Zx, 




= y^pxjx)5k-x = px{k). 



X 



(3.108) 
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Consider again the problem of summing two independent random vari- 
ables X and W with pmf’s px and pw with characteristic functions Mx 
and , respectively. If F = X + W as before we can evaluate the char- 
acteristic function of Y as 



Mviju) = '^PY{y)e^ 



where from the inverse image formula 



Py{v)= X! Px,w{x,w) 

X .,w.x-\-w—y 



so that 



MyUu) = X! ( X! Px,w{x,w)\e^ 
y \x,w:x-\-w—y / 



= X! ( X! Px,w{x,w)f 
y \x,w:x-\-w—y 



juy 



= E E Px,w{x,w)C<^+'^^ 

y \x,w:x-\-w—y / 



= y^,px,w{x,w)t 



ju{x-\-w) 



where the last equality follows because each of the sums for distinct y 
collects together different x and w and together the sums for all y gather 
all of the X and w. This last sum factors, however, as 

MyUu) = ^px{x)pw{w)e^^^e^^'^ 

X,W 

X W 

= Mx{ju)Mw{ju), (3.109) 

which shows that the transform of the pmf of the sum of independent 
random variables is simply the product of the transforms. 

Iterating (3.109) several times gives an extremely useful result that we 
state formally as a theorem. It can be proved by repeating the above 
argument, but we shall later see a shorter proof. 
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Theorem 3.1 If {Xi; i = 1,... ,N} are independent random variables 
with characteristic functions Mx^, then the characteristic function of the 
random variable Y = 

N 

MY{ju) = l[Mx,{ju). (3.110) 

i=l 

If the Xi are independent and identically distributed with common charac- 
teristic function Mx, then 

MY{ju) = M^{ju). (3.111) 

As a simple example, the characteristic function of a binary random 
variable X with parameter p = px(l) = 1 — Px(0) is easily found to be 

1 

Mx{ju) = '^e^^'^px{k) = {l-p)+pe^^ . (3.112) 

k=0 

If {Xp, i = 1, . . . , n} are independent Bernoulli random variables with iden- 
tical distributions and then My„{Ju) = [(1 — p) -|-pet“]" 

and hence 

n 

MyAJu) = y^py„(fc)e^^^ 

k—Q 

= {{i-p)+pC^Y 

= E [( fc ) , 

fc=o LV / 

where we have invoked the binomial theorem in the last step. For the 
equality to hold, however, we have from the uniqueness of transforms that 
PY„{k) must be the bracketed term, that is, the binomial pmf 

PvAk) = ( I ) (1 k e Z„+1. (3.113) 

As in the discrete case, convolutions can be avoided by transforming 
the densities involved. The derivation is exactly analogous to the discrete 
case, with integrals replacing sums in the usual way. 

For a continous random variable X with pmf fx, define the character- 
istic function Mx of the random variable (or of the pmf) as 



Mx{ju)= / fx{x)e^^'^ dx. 



(3.114) 
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As in the discrete case, this can be considered as a special case of expecta- 
tion for continuous random variables as defined in (2.34) so that 

Mx(ju) = E[C^^]. (3.115) 

The characteristic function is related to the the continuous-parameter 
Fourier transform 

IFAfx) = J dx (3.116) 

and the Laplace transform 

Cs{fx) = J fx{x)e^^dx (3.117) 

by Mx{ju) = E-2nu{fx) = d^juifx)- As a result, all of the properties of 
characteristic functions of densities follow immediately from (are equivalent 
to) similar properties from Fourier or Laplace transforms. For example, 
given a well-behaved density fx{x); x € ^ with characteristic function 
Mx{ju), 

fx{x) = ^[ Mx{ju)e~^^^ du. (3.118) 

Consider again the problem of summing two independent random vari- 
ables X and Y with pdf’s fx and fw with characteristic functions Mx and 
M\y, respectively. As in the discrete case it can be shown that 

MyUu) = Mx{ju)Mw{ju). (3.119) 

Rather than mimic the proof of the discrete case, however, we postpone the 
proof to a more general treatment of characteristic functions in chapter 4. 

As in the discrete case, iterating (3.119) several times yields the follow- 
ing result, which now includes both discrete and continous cases. 

Theorem 3.2 If {Xi, i = 1,... ,N} are independent random variables 
with characteristic functions Mxi, then the characteristic function of the 
random variable Y = 

N 

MY{ju) = l[Mx,{ju). (3.120) 

i=l 

If the Xi are independent and identically distributed with common charac- 
teristic function Mx, then 



MyUu) = M^ifu). 



(3.121) 




152 



CHAPTER 3. RANDOM OBJECTS 



As an example of characteristic functions and continuous random vari- 
ables, consider the Gaussian random variable. The evaluation requires a 
bit of effort, either using the “complete the square” technique of calculus 
or by looking up in published tables. Assume that A is a Gaussian random 
variable with mean m and variance Then 



Mx{ju) 



E{C^^) 



/ 



(27ra2)i/2 
1 

-oo (27 Tct 2)1/2' 
.oo ^ 

/_oo (2^2) 1/2 






^ — {x^—2mx—2a^jux+m‘^)/2a^ 



^-(x-{m+jua )) /2cr 



dx 

^jum-y^a^ 12 



j2 



(3.122) 



Thus the characteristic function of a Gaussian random variable with 
mean m and variance a\ is 

Mx{ju) = ^ ( 3123 ) 

If {Xi; i = 1, . . . , n} are independent Gaussian random variables with 
identical densities Af{m, cr^) and then 

MyAJu) = [el“'"-“"'^"/2]n ^ ^ju{nm)-uRna^)/2^ (3.124) 

which is the characteristic function of a Gaussian random variable with 
mean nm and variance na^. 

The following maxim should be kept in mind whenever faced with sums 
of independent random variables: 

When given a derived distribution problem involving the 
sum of independent random variables, first find the characteris- 
tic function of the sum by taking the product of the characteris- 
tic functions of the individual random variables. Then find the 
corresponding probability function by inverting the transform. 

This technique is valid if the random variables are independent 
— they do not have to be identically distributed. 



3.13 Gaussian Random Vectors 

A random vector vector is said to be Gaussian if its density is Gaussian, that 
is, if its distribution is described by the multidimensional pdf explained in 
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chapter 2. The component random variables of a Gaussian random vector 
are said to be jointly Gaussian random variables. Note that the symmetric 
matrix A of the A:— dimensional vector pdf has k{k + l)/2 parameters and 
that the vector m has k parameters. On the other hand, the k marginal 
pdf’s together have only 2k parameters. Again we note the impossibility 
of constructing joint pdf’s without more specification than the marginal 
pdf’s alone. As previously, the marginals will suffice to describe the entire 
vector if we also know that the vector has independent components, e.g., 
the vector is iid. In this case the matrix A is diagonal. 

Although difficult to describe, Gaussian random vectors have several 
nice properties. One of the most important of these properties is that lin- 
ear or affine operations on Gaussian random vectors produce Gaussian ran- 
dom vectors. This result can be demonstrated with only a modest amount 
of work using multidimensional characteristic functions, the extension of 
transforms from scalars to vectors. 

The multidimensional characteristic function of a distribution is defined 
as follows: Given a random vector X = (Xq, . . . , X„_i) and a vector param- 
eter u = (uo, . . . , u„-i), the n-dimensional characteristic function Mx(ju) 
is defined by 



Mx(ju) 



Mxo,... ,x„_i(juo, . . 
E 



E 



/ n—1 ^ 

exp 

fc=o y 



• ,jUn-l) 



(3.125) 



It can be shown using multivariable calculus (problem 3.49) that a Gaussian 
random vector with mean vector m and covariance matrix A has charac- 
teristic function 



Mx(ju) 



gju*m-l/2u*Au 



exp 



n—1 n—1 n — 1 

j ^ UktUk “ 1/2 X! X! "^kHk, m)Um 

_ fc=0 k—0 m—0 



(3.126) 



Observe that the Gaussian characteristic function has the same form as 
the Gaussian pdf — an exponential quadratic in its argument. However, 
unlike the pdf, the characteristic function depends on the covariance matrix 
directly, whereas the pdf contains the inverse of the covariance matrix. 
Thus the Gaussian characteristic function is in some sense simpler than 
the Gaussian pdf. As a further consequence of the direct dependence on 
the covariance matrix, it is interesting to note that, unlike the Gaussian 
pdf, the characteristic function is well-defined even if A is only nonnegative 
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definite and not strictly positive definite. Previously we give a definition of 
a Gaussian random vector in terms of its pdf. Now we can give an alternate, 
more general (in the sense that a strictly positive definite covariance matrix 
is not required) definition of a Gaussian random vector and hence random 
process): 

A random vector is Gaussian if and only if it has a characteristic function 
of the form of (3.126). 



3.14 Examples: Simple Random Processes 

In this section several examples of random processes defined on simple 
probability spaces are given to illustrate the basic definition of an infinite 
collection of random variables defined on a single space. In the next section 
more complicated examples are considered by defining random variables on 
a probability space which is the output space for another random process, 
a setup that can be viewed as signal processing. 



[ 3 . 22 ] Gonsider the binary probability space {0,E,P) with O = {0, 

the usual event space, and P induced by the pmf p(0) = a and p(l) = 
1 — Of, where a is some constant, 0 < a < 1. Define a random process 
on this space as follows: 



X{t,uj) = cos{ojt) 



cos(t), t G 3? if w = 1 
1, t G 3? if w = 0 . 



Thus if a 1 occurs a cosine is sent forever, and if a 0 occurs a constant 
1 is sent forever. 



This process clearly has continuous time and at first glance it might 
appear to also have continuous amplitude, but only two waveforms are 
possible, a cosine and a constant. Thus the alphabet at each time contains 
at most two values and these possible values change with time. Hence this 
process is in fact a discrete amplitude process and random vectors drawn 
from this source are described by pmf’s. We can consider the alphabet of 
the process to be either 3?^ or [—1, 1]^, among other possibilities. Fix time 
at t = 7t/2. Then X{tt/2) is a random variable with pmf 

. . f a, if a; = 1 
Pxi^/2){x) - ^ ifcc = 0. 

The reader should try other instances of time. What happens at t = 
0, 27 t, 47t, m . . . ? 
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[ 3 . 23 ] Consider a probability space (fl,P,P) with H = 3?, the 

Borel field, and probability measure P induced by the pdf 



f(r) 



1 if r G [0, 1] 
0 otherwise . 



Again define the random process {A(t)} by X{t, lo) = cos(wt); t G 3?. 

Again the process is continuous time, but now it has mixed alphabet 
because an uncountable infinity of waveforms is possible corresponding to 
all angular frequencies between 0 and 1 so that X{t,co) is a continuous 
random variable except at t = 0. A(0, w) = 1 is a discrete random variable. 

If you calculate the pdf of the random variable X{t) you see that it varies 
as a function of time (problem 3.25). 

[ 3 . 24 ] Consider the probability space of example [3.23], but cut it down to 

the unit interval; that is, consider the probability space ([0, 1), ,B([0, 1)),P) 
where P is the probability measure induced by the pdf /(r) = 1; r G 
[0, 1). (So far this is just another model for the same thing.) Define 
for n = 1,2 .. . ,X„(lo) = &„(w) = the digit binary expansion of 
to, that is 

OO 

w = ^ 6„2-” 

n—1 

or equivalently w = .6i&2^3 ... in binary. 

{A„; n = 1, 2 . . . } is a one-sided discrete alphabet random process with 
alphabet {0, 1}. It is important to understand that nature has selected u> 
at the beginning of time, but the observer has no way to determining LI 
completely without waiting until the end of time. Nature only reveals one 
bit of to per unit time, so the observer can only get an improved estimate of 
LO as time goes on. This is an excellent example of how a random process 
can be modeled by selecting only a single outcome, yet the observer sees a 
process that evolves forever. 

In this example our change in the sample space to [0, 1] from 3? was 
done for convenience. By restricting the sample space we did not have to 
define the random variable outside of the unit interval (as we would have 
had to do to provide a complete description). 

At times it is necessary to extend the definition of a random process 
to include vector-valued functions of time so that the random process is a 
function of three arguments instead of two. The most important extension 
is to complex-valued random processes, i.e., vectors of length 2. We will 
not make such extensions frequently but we will include an example at this 
time. 
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[ 3 . 25 ] Random Rotations 

Given the same probability space as in example [3.24], define a complex- 
valued random process {X„} as follows: Let a be a fixed real param- 
eter and define 

X„(u;) = ; n=l,2,3,... 



This process, called the random rotations process, is a discrete time 
continuous (complex) alphabet one-sided random process. Note that an 
alternative description of the same process would be to define to define O, 
as the unit circle in the complex plane together with its Borel field and to 
define a process Yn{co) = (Pio for some fixed c € for some fixed c G O 
; this representation points that successive values of are obtained by 
rotating the previous value through an angle determined by c. 

Note that the joint pdf of the complex components of varies with 
time, n, as does the pdf in example [3.23] (problem 3.28). 

[ 3 . 26 ] Again consider the probability space of example [3.24]. We define a 
random process recursively on this space as follows: Define Xq = oj 
and 



Xn{co) = 2Xn-i{u!) mod 1 



2A„_i(u;) ifO< A„_i(w) < 1/2 

2A„_i(u;)-l if l/2< A„_i(a;) < 1, 



where r mod 1 is the fractional portion of r. In other words, if 
Xn-i(uj) = a; is in [0,1/2), then X„(co) = 2x. If A„_i(w) = a; is 
in [1/2,1), then X„(o;) = 2a; — 1. 

[ 3 . 27 ] Given the same probability space as in the example [3.26], define 
X{t,uj) = cos{t + 2TTUj),t G 3?. The resulting random process {X(t)| 
is continuous time and continuous amplitude and is called a random 
phase process since all of the possible waveforms are shifts of one 
another. Note that the pdf of X{t,uj) does not depend on time (prob- 
lem 3.29. 



[ 3 . 28 ] Take any one of the foregoing (real) processes and quantize or clip 
it; that is, define a binary quantizer q by 



q{r) 



a if r > 0 
b if r < 0 



and define the process Y{t,Lo) = q{X{t,uj)), all t. (Typically b = —a.) 
This is a common form of signal processing, converting a continuous 
alphabet random process into a discrete alphabet random process. 
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This process is discrete alphabet and is either continuous or discrete 
time, depending on the original X process. In any case Y(t) has a binary 
pmf that, in general, varies with time. 

[3.29] Say we have two random variables U and V defined on a common 
probability space {D,1F,P). Then 

X{t) = U cos(27r/ot + V) 

defines a random process on the same probability space for any fixed 
parameter /q. 

All the foregoing random processes are well defined. The processes in- 
herit probabilistic descriptions from the underlying probability space. The 
techniques of derived distributions can be used to compute probabilities 
involving the outputs since, for example, any problem involving a single 
sample time is simply a derived distribution for a single random variable, 
and any problem involving a finite collection of sample times is a single ran- 
dom vector derived distribution problem. Several examples are explored in 
the problems at the end of the chapter. 

3.15 Directly Given Random Processes 

3.15.1 The Kolmogorov Extension Theorem 

Consistency of distributions of random vectors of various dimensions plays 
a far greater role in the theory and practice of random processes than sim- 
ply a means of checking the correctness of a computation. We have thus far 
argued that a necessary condition for a set of random vector distributions to 
describe collections of samples taken from a random process is that the dis- 
tributions be consistent, e.g., given marginals and joints we must be able to 
compute the marginals from the joints. The Kolmogorov extension theorem 
states that consistency is also sufficient for a family of finite-dimensional 
vector distributions to describe a random process, that is, for there to exist 
a well defined random process that agrees with the given family of finite 
dimensional distributions. We state the theorem without proof as the proof 
is far beyond the assumed mathematical prerequisites for this course. (The 
interested reader is referred to [45, 6, 22].) Happily, however, it is often 
straightforward, if somewhat tedious, to demonstrate that the conditions 
of the theorem hold and hence that a proposed model is well-defined. 

Theorem 3.3 Kolmogorov Extension Theorem 

Suppose that one is given a consistent family of finite dimensional distri- 
butions Pxtg,Xt„,... ,Xt^ j for all positive integers k and all possible sample 
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times U G T; i = 0, 1, . . . , fc — 1. Then there exists a random process 
{Xt; t G T} that is consistent with this family. In other words, in order to 
completely describe a random process, it is sufficient to describe a consistent 
family of finite dimensional distributions of its samples. 

3.15.2 IID Random Processes 

The next example extends the idea of an iid vector to provide one of the 
most important random process models. Although such processes are sim- 
ple in that they possess no memory among samples, they play a fundamental 
role as a building block for more complicated processes as well as being an 
important example in their own right. In a sense these are the most ran- 
dom of all possible random processes because knowledge of the past does 
not help predict future behavior. 

A discrete-time random proces {Xn} is said to be iid if all finite- 
dimensional random vectors formed by sampling the process are iid; that 
is, if for any k and any collection of distinct sample times to,ti,. . . 
the random vector . . . , is iid. 

This definition is equivalent to the simpler definition of the Introduction 
to this chapter, but the more general form is adopted because it more closely 
resembles definitions to be introduced later, iid random processes are often 
called Bernoulli processes, especially in the binary case. 

It can be shown with cumbersome but straightforward effort that the 
random process of [3.24] is in fact iid. In fact, for any given marginal 
distribution there exists an iid process with that marginal distribution. Al- 
though eminently believable, this fact requires the Kolmogorov extension 
theorem, which states that a consistent family of finite-dimensional distri- 
butions implies the existence of a random process described or specified by 
those distributions. The demonstration of consistency for IID processes is 
straightforward and readers are encouraged to convince themselves for the 
case of n-dimensional distributions reducing to n — 1 dimensional distribu- 
tions. 

3.15.3 Gaussian Random Processes 

A random process is Gaussian if for all positive integers k and all possible 
sample times ti G T; i = 0,1, . . . , k—1, the random vectors {Xt„ , , . . . , Xt,. 

are Gaussian. 

In order to describe a Gaussian process and verify the consistency con- 
ditions of the Kolmogorov extension theorem, one has to provide the A 
matrices and m vectors for all of the random vector (Ajg, . . . , Atj,_J. 
This is accomplished by providing a mean function m{t); t G T and a 
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covariance function A{t,s); t,s G T, which then yield all of the required 
mean vectors and covariance matrices by sampling, that is, the mean vector 
for (Xjg, is . . . ,m(tfc_i)) and the covariance 

matrix is A = {A(fi,tj); l,j G Zk}. 

That this family of density functions are in fact consistent is much 
more difficult to verify than was the case for iid processes, but it requires 
straightforward brute force in calculus rather than any deep mathematical 
ideas to to do so. 

The Gaussian random process in both discrete and continuous time is 
virtually ubiquitous in the analysis of random systems. This is both because 
the model is good for a wide variety of physical phenomena and because it 
is extremely tractable for analysis. 

3.16 Discrete Time Markov Processes 

An iid process is often referred to as a memoryless process because of the 
independence among the samples. Such a process is both one of the simplest 
random processes and one of the most random. It is simple because the 
joint pmf’s are easily found as products of marginals. It is “most random” 
because knowing the past (or future) outputs does not help improve the 
probabilities describing the current output. It is natural to seek straight- 
forward means of describing more complicated processes with memory and 
to analyze the properties of processes resulting from operations on iid pro- 
cesses. A general approach towards modeling processes with memory is to 
filter memoryless processes, to perform an operation (a form of signal pro- 
cessing) on an input process which produces an output process that is not 
iid. In this section we explore several examples of such a construction, all of 
which provide examples of the use of conditional distributions for describing 
and investigating random processes. All of the processes considered in this 
section will prove to be examples of Markov processes, a class of random 
processes possessing a specific form of dependence among current and past 
samples. 

3.16.1 A Binary Markov Process 

Suppose that {Xn, n = 0,l,...}isa Bernoulli process with 

PxAx)=l^ ^ ^ (3.127) 

I 1 — p X = v>, 

where p G (0, 1) is a fixed parameter. Since the pmf does not depend on n, 
the subscript is dropped and the pmf abbreviated to px. The pmf can also 
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be written as 



px{x) = p"^{l - pY ^;a; = 0,l. (3.128) 

Since the process is assumed to be iid, 

n—1 

PxAxn = Y[px{x,)=p^^^’'H^-Pr~"^^"\ (3.129) 

1=0 

where w(x'^) is the number of nonzero Xi in x”, the Hamming weight of the 
binary vector x". 

We consider using {X^} as the input to a device which produces an 
output binary process {Wi}- The device can be viewed as a signal processor 
or as a linear filter. Since the process is binary, the most natural “linear” 
operations are those in the binary alphabet using modulo 2 arithmetic 
as defined in (3.65-3.66). Consider the new random process {Yn; n = 
0, 1, 2, . . . } defined by 



Yo n = 0 

Xn 0 Yn—1 Tl = 1 , 2 ,..., 



(3.130) 



where Yq is a binary equiprobable random variable (pyg(O) = _pyo(l) = 1*.3) 
assumed to be independent of all of the X„. This is an example of a linear 
(modulo 2) recursion or difference equation. The process can also be defined 
for n = 1, 2, . . . by 



Tn 



1 ifx„^y„_i 

0 ifx„ = y„_i 



This process is called a binary autoregressive process. 

It should be apparent that Yn has quite different properties from Xn- 
In particular, it depends strongly on past values. Since p < 1/2, is 
more likely to equal Wi-i than it is to differ. If p is small, for example, 
Yn is likely to have long runs of O’s and I’s. {T„} is indeed a random 
process because it has been defined as a sequence of random variables on a 
common experiment, the outputs of the {Xn} process and an independent 
selection of Yq. Thus all of its joint pmf’s pv^iy"') = Pr(T" = y") should 
be derivable from the inverse image formula. We proceed to solve this 
derived distribution and then to interpret the result. 

Using the inverse image formula in the general sense, which involves 
finding a probability of an event involving U" in terms of the probability 
of an event involving X” (and, in this case, the initial valueVo)) yields the 
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following sequence of steps: 
pyn(y") = Pr(F" = y") 

= Pr(Fo = yojYi = yi,Y 2 = ?/ 2 , • ■ • ,Yn-l = yn-l) 

= Pr(Fo = yoi = yi,X2 © Yi = ?/2) ■ • ■ ) Xn-1 © Y^-2 = 2 /n-l) 

= Pr(Fo = yOlXi © yo = Vl,X2 © yi = ?/2, • ■ • ,^n-l © y«-2 = yn-l) 

= Pr(Fo = yOlXi = j/1 © yo,-’^2 = 2/2 © yi, • ■ • ,Xn-i = y„_i © yn-2) 

= pyo.^i,^2,X3,...,jf„_i(yo,yi © yo,y2 © yi, ■ • • ,y„-i©y„-2) 

n—l 

= PYo{yo)Y[px{yi®yi-i)- (3.131) 

i=l 

The derivation used the fact that a © 6 = c if and only if a = 6 © c and 
the independence of Fq, Xi, X 2 , ... , Xn-i and the fact that the are 
iid. This formula completes the first goal, except possibly plugging in the 
specific forms of py, and px to get 

- n — l 

pvAyl = ( 3 . 132 ) 

i=l 

The marginal pmf’s for Y„ can be evaluated by summing out the joints, 

e.g., 

PYiiyi) = '^PYo,Yi{yo,yi) 
yo 

— 1 _ pY-yi®vo 

Vo 

= ^ ; 2/1 = 0.1- 

In a similar fashion it can be shown that the marginals for F„ are all the 
same: 

PVn ( 2 /) = ^; 2/ = 0, 1; n = 0, 1, 2, . . . , (3.133) 

and hence as with Xn the pmf can be abbreviated as py, dropping the 
subscript. 

Observe in particular that unlike the iid {X„} process, 

n—l 

PF"(2/") 7^ WpY{yi) 

z=0 



(3.134) 
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and hence {Yn} is not an iid process and the joint pmf cannot be written 
as a product of the marginals. Nonetheless, the joint pmf can be written as 
a product of simple terms, as has been done in (3.132). From the definition 
of conditional probability and (3.131) 



Pyi\Yo,Yi,... ,y,_i(2/i|yo)2/i> ■ • ■ ^Ui-i) 



PYi{y’-) 



pxivi © yi-i) 

(3.135) 



and (3.131) is then recognizable as the chain rule (3.51) for the joint pmf 
Py(2/”)- 

Note that the conditional probability of the current output Yi given the 
values for the entire past 1^; i = 0, 1, ...,/ — 1 depend only on the most 
recent past output Yi-\! This property can be summarized nicely by also 
deriving the conditional pmf 



PYi\Yi^Ayi\yi-i) 



PYi^i,Yiiypyi-i) 

PYi.Ayi-i) 



(3.136) 



which with a little effort resembling the previous derivation can be evaluated 
as py*®y*-i(l— Thus for the {Yn} process has the property that 



PYi\Yo,Yi,... ,Yi.i{yi\yo,yi, ■ ■ ■ ,yi-i) =py|y_i(j/i|y*-i)- (3.137) 

A discrete time random process with this property is called a Markov pro- 
cess or Markov chain. Such processes are among the most studied random 
processes with memory. 



3.16.2 The Binomial Counting Process 

We next turn to a filtering of a Bernoulli process that is linear in the 
ordinary sense of real numbers. Now the input processess will be binary, 
but the output process will have the nonnegative integers as an alphabet. 
Simply speaking, the output process will be formed by counting the number 
of heads in a sequence of coin flips. 

Let {Xn} be iid binary random process with marginal pmf px(l) = p = 
1 — Px(0). Define a new one-sided random process {T„; n = 0, 1, . . . } by 



Yo = 0 n = 0 

ELl^k = Yn-l+Xn n=l,2,... 



(3.138) 



For n > 1 this process can be viewed as the output of a discrete time time- 
invariant linear filter with Kronecker delta response hk given by /i^ = 1 
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for /c > 0 and hk = 0 otherwise. From (3.138), each random variable 
provides a count of the number of I’s appearing in the Xn process through 
time n. Because of this counting structure we have that either 

= y„_i or = y„_i + 1 ; n = 2,3, . . . . (3.139) 

In general, a discrete time process that satisfies (3.139) is called a counting 
process since it is nondecreasing, and when it jumps, it is always with an 
increment of 1. (A continuous alphabet counting process is similarly defined 
as a process with a nondecreasing output which increases in steps of 1.) 

To completely describe this process it suffices to have a formula for the 
joint pmf’s 

n 

,Vn) =PYAyi)Y[pYt\Yu...M_^{yi\yi,--- ,yi-i), (3.140) 

1=1 

since arbitrary joint distributions can be found from such joint distribu- 
tions of contiguous samples by summing out the unwanted dummy vari- 
ables. When we have constructed one process {Y^} from an existing process 
{Xn}, we need not worry about consistency since we have defined the new 
process on an underlying probability space (the output space of the original 
process), and hence the joint distributions must be consistent if they are 
correctly computed from the underlying probability measure — the process 
distribution for the iid process. 

Since Yn is formed by summing n Bernoulli random variables, the pmf 
for Yn follows immediately from (3.113), it is the binomial pmf and hence 
the process is referred to as the binomial counting process. 

The joint probabilities could be computed using the vector inverse image 
formula as with the binary Markov source, but instead we focus on the 
conditional distributions and compute them directly. The same approach 
could have been used for the binary Markov example. 

To compute the conditional pmf’s involves describing probabilistically 
the next output of the process if we are given the previous n—1 outputs 
Yi, . . . , Yn-i. For the binomial counting process, the next output is formed 
simply by adding a binary random variable to the old sum. Thus all of the 
conditional probability mass is concentrated on two values — the last value 
and the last value plus 1. The conditional pmf’s can therefore be expressed 
as 

,Fi(2/n|l/n-l, ■ • • ,yi) 

= Pr{Yn = yn\Yi = yi;l=l,... ,yn-i)) 

= Pr{X„ = yn-yn-i\Yi=yi;l = l,...,yn-i)) (3.141) 

= Pr(A„ = yn- yn-i\Xi = yi,X, = yt- yi-i; i = 2,3, . . . ,n - 1), 
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since from the definition of the Yn process the conditioning event {Yi = 
Hi] i = 1,2, . . . , n— 1} is identical to the event {Xi = yi,Xi = yi — yi-i; i = 
2, 3, . . . , n — 1} and, given this event, the event Y„ = yn is identical to the 
event Xn = yn — 2/n-i- In words, the Y„ will assume the given values if 
and only if the Xn assume the corresponding differences since the Yn are 
defined as the sum of the Xn- Now, however, the probability is entirely in 
terms of the given Xi variables, in particular, 

PY„\Y„.i,...,Yi{yn\yn-l,--- ,2/l)= (3.142) 

,X 2 ,Xi (yn — yn-l\yn-l — yn-2, • ■ • , ?/2 ~ 2/1, 2/l) ■ 

So far the development is valid for any process and has not used the fact 
that the {Xn} are iid If the {Xn} are iid, then the conditional pmf’s are 
simply the marginal pmf’s since each X„ is independent of past X^', k < n\ 
Thus we have that 

PF„|V„_i.....Fi(2/n|2/n-l,--- , Vl) = PX {Vn ~ yn-l) ■ (3.143) 

and hence from the chain rule the vector pmf is (defining j/q = 0) 

n 

pfi.....y„(2/i, ■ • ■ ,yn) = Y[px{yi - yt-i) , (3.144) 

i=l 

providing the desired specification. 

To apply this formula to the special case of the binomial counting pro- 
cess, we need only plug in the binary pmf for pxto obtain the desired 
specification of the binomial counting process: 

n 

where 

yi - yi-i = 0 or 1, i = 1, 2, . . . , n ; yo = 0 . (3.145) 

A similar derivation could be used to evaluate the conditional pmf for 
Yn given only its immediate predecessor as: 

PynlVn-i(ynbn-l) = = yn\yn-l = yn-l) 

= Pr(^7^ = y^i y-n—i\Yji—\ = yji—\) . 

The conditioning event, however, depends only on values of Xk for k < n, 
and Xn is independent of its past; hence 



PY„\Y„.i{yn\yn-l) = PxiVn- yn-l) ■ 



(3.146) 
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The same conclusion can be reached by the longer route of using the joint 
pmf for Yi, . . . ,Yn previously computed to find the joint pmf for and 
Yn-i, which in turn can be used to find the conditional pmf. Comparison 
with (3.143) reveals that processes formed by summing iid processes (such 
as the binomial counting process) have the property that 

PF„|y„_i.....Fi(2/«|yn-l,--- ,yi) = PY^\Y„.Ayn\yn-l) (3.147) 

or, equivalently, 

P^(Y.fi — y^i \ Yi — yi j 1 — 1,... ,7z 1 ) — Pv(Yji — y^i — yn— i ) 5 

(3.148) 

that is, they are Markov processes. Roughly speaking, given the most recent 
past sample (or the current sample), the remainder of the past does not 
affect the probability of what happens next. Alternatively stated, given the 
present, the future is independent of the past. 



3.16.3 TtDiscrete Random Walk 



As a second example of the preceding development, consider the random 
walk defined as in (3.138), i.e., by 



Y„, = 



0 n = 0 

n=l,2,.... 



(3.149) 



where the iid process used has alphabet {1,-1} and Pr(A„ = —1) = p. 
This is another example of an autoregressive process since it can be written 
in the form of a regression 



y„ = y„_i+A„, n=l,2,... (3.150) 



One can think of Y„ as modeling a drunk on a path who flips a coin at each 
minute to decide whether to take one step forward or one step backward. 
In this case the transform of the iid random variables is 



Mx{ju) = (1 — p)e^“ +pe , 

and hence using the binomial theorem of algebra we have that 
My„Uu) = ((1 -p)e^“ +pe-^“)” 



= E 









ju(n-2k) 



E 

— n, — n+2,... ,n— 2,n L 



n 



.— k (1 — p)(”+'=)/2p("-fe)/2 
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Comparison of this formula with the definition of the characteristic func- 
tion reveals that the pmf for is given by 

PYr,{k) = ^ n — fc ^ ( 1 — ^ k = —n,—n+2, . . . ,n—2,n . 

Note that must be even or odd depending on whether n is even or odd. 
This follows from the nature of the increments. 

3.16.4 The Discrete Time Wiener Process 

Again consider a process formed by summing an iid process as in (3.138). 
This time, however, let {A„} be an iid process with zero-mean Gaussian 
marginal pdf’s and variance Then the process {1^} defined by (3.138) 
is called the discrete time Wiener process. The discrete time continuous 
alphabet case of summing iid random variables is handled in virtually the 
same manner is the discrete time case, with conditional pdf’s replacing 
conditional pmf’s. 

The marginal pdf for Yn is given immediately by (3.124) as N{0,nax). 
To find the joint pdf’s we evaluate the pdf chain rule of (3.63): 



fe-i 

/yi....,Y'„(yi,--- ,vn) = ,yi-i)- (3.i5i) 

1=1 

To find the conditional pdf /y„|yi,.., (j/nlyi, ■ • ■ ^Vn-i) we compute the 
conditional cdf P(T„ < = yn-i', f = 1, 2, . . . ,n — 1). Analogous to 

the discrete case, we have from the representation of (3.138) and the fact 
that the A„ are iid that 

B{^n Y Pn |ln— i — yn—i] i=l,2,... 1) 

— B{Nji ^ Pji 2/n— 1 |ln— i — Pn—i] f = l,2,... ,7T 1) 

~ — Pn yn—l) 

= Fx{Pn-yn-l), (3.152) 

and hence differentiating the conditional cdf to obtain the conditional pdf 
yields 

fYn\Yl,.. ,Yn-l (l/nlyi) ■ ■ ■ J yn—l) — Fx {Pn Pn—l) — fxijjn Pn—l)^ 

(3.153) 
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the continuous analog of (3.143). Application of the pdf chain rule then 
yields the continuous analog to (3.144): 

n 

/yi.....Y„(2/i, ■ • • ,yn-i) ^YifxiVz - Vt-i) ■ (3.154) 

i=l 



Finally suppose that fx is Gaussian with zero mean and variance a^. Then 
this becomes 



frAvl 



e 

V27TCT^ 



n 



e ^ 



(27T(7^) 






(3.155) 



This proves to be a Gaussian pdf with mean vector 0 and a covariance 
matrix with entries Kx{m,n) = (J^min(m, n), m,n = 1,2,.... (Readers 
are invited to test their matrix manipulation skills and verify this claim.) 

As in the discrete alphabet case, a similar argument implies that 

fY„\Y„_Ayn\yn-l) = fx{yn ~ 2/n-l) 
and hence from (3.153) that 

fY„\Y^,...,Y„.^{yn\yl,■■■ ,yn-l) = fY„\Y„.^{yn\yn-l)- (3.156) 

As in the discrete alphabet case, a process with this property is called a 
Markov process. We can combine the discrete alphabet and continuous 
alphabet definitions into a common definition: A discrete time random 
process {¥„} is said to be a Markov process if the conditional cdf’s satisfy 
the relation 



P^{Vn ^ yn\Vn—i — Pn—i^ ^ — f j 2, . . . ) — PliVji ^ — y^—l} 

(3.157) 

for all y„-i,y„- 2 , • ■ • • More specifically, {Y„} is frequently called a first- 
order Markov process because it depends on only the most recent past 
value. An extended definition to nth order Markov processes can be made 
in the obvious fashion. 



3.16.5 Hidden Markov Models 

A popular random process model that has proved extremely important 
in the development of modern speech recognition is formed by adding an 
iid process to a Markov process, so that the underlying Markov process 
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is “hidden.” Suppose for example that {Xn} is a Markov process with 
either discrete or continuous alphabet and that {W„} is an iid process, 
for example an iid Gaussian process. Then the resulting process = 
Xn + Wn is an example of a hidden Markov model or, in the language 
of early information theory, a Markov source. A wide literature exists 
for estimating the parameters of the underlying Markov source when only 
the sum process is actually observed. A hidden Markov model can 
be equivalently considered as viewing a Markov process through a noisy 
channel with iid Gaussian noise. 



3.17 ANonelementary Conditional Probabil- 
ity 



Perhaps the most important form for conditional probabilities is the basic 
form of Pr(y € F\X = x), a probability measure on a random variable Y 
given the event that another random variable X takes on a specific value x. 
We consider a general event Y G F and not simply Y = y since the latter 
is usually useless in the continuous case. In general, either or both P or A 
might be random vectors. 

In the elementary discrete case, such conditional probabilities are easily 
constructed in terms of conditional pmf’s using (3.47): conditional prob- 
ability is found by summing conditional probability mass over the event, 
just as is done in the unconditional case. We have proposed an analogous 
approach to continuous probability, but this does not lead to a useful gen- 
eral theory. For example, it assumes that the various pdf’s all exist and are 
well behaved. As a first step towards a better general definition (which will 
reduce in practice to the constructive pdf definition when it makes sense), 
we derive a variation of (3.47). Multiply both sides of (3.47) by px{x) and 
sum over an A-event G to obtain 



Y, P{Y G F\X = x)px{x) 

xGG 



EE PY\x{y\x)px{x) 

xGGyGF 

EE Px,Y{x,y) 

xGGyGF 

P{X gG,Y GF) 

Px,y{G X F); all events G(3.158) 



This formula in a sense discribes the essence of the conditional probability 
by saying what it does: For any A event G, summing the product of the 
conditional probability that Y G F and the marginal probability that X = x 
over all a; € G yields the joint probability that X G G and Y G F. If our 
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tentative definition of nonelementary conditional probability is to be useful, 
in must play a similar role in the continuous case, that is, we should be able 
to average over conditional probabilities to find ordinary joint probabilities, 
where now averages are integrals instead of sums. This indeed works since 



j dxP{Y S F\X = x)fx{x) = j dx j 

J xGG j xGG j y 



dx I dyfY\x{y\x)fx{x) 
xeG JyeF 



= dx dyfx,Y(x,y) 

J x^G Jy^F 

= P{X€G,Y€F) 

= Px,y{G X F); all events G(3.159) 



Thus the tentative definition of nonelementary conditional probability of 
(3.53) behaves in the manner that one would like. Using the Stieltjes no- 
tation we can combine (3.158) and (3.159) into a single requirement: 



/ p{y 

JG 



GF\X = x) dFx{x) 



P{x eG,Y eF) 

Px,y{G X F); all events G(3.160) 



which is valid in both the discrete case and in the continuous case when 
one has a conditional pdf. In advanced probability, (3.160) is taken as the 
definition for the general (nonelementary) conditional probability P(Y G 
F\X = x); that is, the conditional probability is defined as any function 
of X that satisfies (3.160). This is a descriptive definition which defines an 
object by its behavior when integrated, much like the rigorous definition of 
a Dirac delta function is by its behavior inside an integral. This reduces to 
the given constructive definitions of (3.47) in the discrete case and (3.53) 
in the continuous case with a well behaved pdf. It also leads to a useful 
general theory even when the conditional pdf is not well defined. 

Lastly, we observe that elementary and nonelementary conditional prob- 
abilities are related in the natural way. Suppose that G is an event with 
nonzero probability so that the elementary conditional probability P(Y G 
F\X G G) is well defined. Then 



P{Y G F\X G G) 



Px,y{G X F) 

Px{G) 

I PiY G F\X = x) dFxix). (3.161) 
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3.18 Problems 

1. Given the probability space (3?, ;8(3?)), to), where to is the probability 
measure induced by the uniform pdf / on [0, 1] (that is, /(r) = 1 for 
r G [0, 1] and is 0 otherwise), find the pdf’s for the following random 
variables defined on this space: 

(a) X(r) = |rp , 

(b) y(r) = , 

(c) Z(r) =ln|r| , 

(d) V(r) = ar + b , where a and b are fixed constants. 

(e) Find the pmf for the random variable VF(r) = 3 if r > 2 and 
W{r) = 1 otherwise. 

2. Do problem 3.1 for an exponential pdf on the original sample space. 

3. Do problem 3.1(a)-(d) for a Gaussian pdf on the original sample space. 

4. A random variable X has a uniform pdf on [0, 1]. What is the prob- 
ability density function for the volume of a cube with sides of length 
A? 

5. A random variable X has a cumulative distribution function Fx{ct). 
What is the cdf of the random variable Y = aX + b, where a and b 
are constants? 

6. Use the properties of probability measures to prove the following facts 
about cdf’s: If F is the cdf of a random variable, then 

(a) F{—oo) = 0 and F{oo) = 1. 

(b) F{r) is a monotonically nondecreasing function, that is, if a; > y, 
then F{x) > F{y). 

(c) F is continuous from the right, that is, if e„, n = 1,2,... is a 
sequence of positive numbers decreasing to zero, then 

lim F{r + e„) = F(r) . 

n—oo 

Note that continuity from the right is a result of the fact that we 
defined a cdf as the probability of an event of the form (— oo, r]. 
If instead we had defined it as the probability of an event of the 
form (— oo,r) (as is often done in Eastern Europe), then cdf’s 
would be continuous from the left instead of from the right. 
When is a cdf continuous from the left? When is it discontinu- 
ous? 
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7. Say we are given an arbitrary cdf F for a random variable and we 
would like to simulate an experiment by generating one of these ran- 
dom variables as input to the experiment. As is typical of computer 
simulations, all we have available is a uniformly distributed random 
variable U; that is, U has the pdf of 3.1. This problem explores a 
means of generating the desired random variable from U (this method 
is occasionally used in computer simulations). Given the cdf F, de- 
fine the inverse cdf F~^{r) as the smallest value of a: G 5ft for which 
F{x) > r. We specify “smallest” to ensure a unique definition since 
F may have the same value for an interval of x. Find the cdf of the 
random variable Y defined hy Y = F~^{U). 

This problem shows how to generate a random variable with an arbi- 
trary distribution from a uniformly distributed random variable using 
an inverse cdf. Suppose next that A is a random variable with cdf 
Fx{a). What is the distribution of the random variable Y = Fx{X)l 
This mapping is used on individual picture elements (pixels) in an 
image enhancement technique known as “histogram equalization” to 
enhance contrast. 

8. You are given a random variable U described by a pdf that is 1 on 
[0, 1]. Describe and make a labeled sketch of a function g such that 
the random variable Y = g{U) has a pdf Ae~'^“; a; > 0. 

9. A probability space {Lt,F,P) models the outcome of rolling two fair 
four-sided dice on a glass table and reading their down faces. Hence 
we can take Lt = {1,2, 3, 4}^, the usual event space (the power set 
or, equivalently, the Borel field), and a pmf placing equal probability 
on all 16 points in the space. On this space we define the following 
random variables: W{uj) = the down face on die #1; that is, if w = 
( 101 , 102 ), where coi denotes the down face on die # i, then W{u> = 
oo\. (We could use the sampling function notation here: W = Hi-) 
Similarly, define V(co) = 002 , the down face on the second die. Define 
also X(ijj) = u>i+ U02, the sum of the down faces, and Y(w) = u>2t02, 
the product of the down faces. Find the pmf and cdf for the random 
variables X, Y, W, and V. Find the pmf’s for the random vectors 
(X, Y) and (W, V). Write a formula for the distribution of the random 
vector (IF, V) in terms of its pmf. 

Suppose that a greedy scientist has rigged the dice using magnets to 
ensure that the two dice always yield the same value; that is, we now 
have a new pmf on 12 that assigns equal values to all points where 
the faces are the same and zero to the remaining points. Repeat the 
calculations for this case. 
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10. Consider the two-dimensional probability space (3?^, P), where 

P is the probability measure induced by the pdf g, which is equal to 
a constant c in the square {{x,y) : x G [—1/2, 1/2], y G [—1/2, 1/2]} 
and zero elsewhere. 

(a) Find the constant c. 

(b) Find P{{x,y : x < y}). 

(c) Define the random variable U : ^ ^ by U{x,y) = x + y. 

Find an expression for the cdf Fjj{u) = Pr{U < u). 

(d) Define the random variable V : 3?^ ^ 3? by C(x, y) = xy. Find 
the cdf Fv{v). 

(e) Define the random variable W : ^ ?ft by W(x, y) = max(x, y), 

that is, the larger of the two coordinate values. Thus max(a;, y) = 
X a x>y. Find the cdf Fw{w). 

11. Suppose that X and Y are two random variables described by a pdf 



(a) Find C. 

(b) Find the marginal pdf’s fx and fy- Are X and Y independent? 
Are they identically distributed? 

(c) Define the random variable Z = X — 2Y. Find the joint pdf 
fx,z- 

12. Let {X, Y) be a random vector with distribution Px,y induced by the 
pdf fx,Y{x,y) = fx{x)fY{y), where 

fx{x) = /v(x) = ; a; > 0 , 

that is, {X, Y) is described by a product pdf with exponential com- 
ponents. 

(a) Find the pdf for the random variable U = X + Y. 

(b) Let the “max” function be defined as in problem 3.10 and de- 
fine the “min” function as the smaller of two values; that is, 
min(x, y) = X if X < y. Define the random vector (IF, V) by 
IF = min(X, F) and V = max{X,Y). Find the pdf for the 
random vector (IF, F). 

13. Let (X,Y) be a random vector with distribution Px,y induced by 
a product pdf fx,Y{x,y) = fx{x)fy{y) with fx{x) = fy{y) equal 
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to the Gaussian pdf with m = 0. Consider the random vector as 
representing the real and imaginary parts of a complex-valued mea- 
surement. It is often useful to consider instead a magnitude-phase 
representation vector {R, 6), where R is the magnitude {X“^ + y 2 ^ 1/2 
and 9 = tan“^(y/X) (use the principal value of the inverse tangent). 
Find the joint pdf of the random vector (R,9. Find the marginal 
pdf’s of the random variables R and 9. The pdf of R is called the 
Rayleigh pdf. Are R and 9 independent? 

14. A probability space {Ll,T,P) is defined as follows: Ll consists of all 

8-dimensional binary vectors, e.g., every member of Ll has the form 
to = (wq,... where uji is 0 or 1. T is the power set, P is 

described by a pmf which assigns a probability of 1 /2® to each of the 
2® elements in LI (a uniform pmf). 

Find the pmfs describing the following random variables: 

(a) g{uj) = ^ number of I’s in the binary vector. 

(b) X{iv) = 1 if there are an even number of I’s in uj and 0 otherwise. 

(c) Y{uj) = LOj, i.e., the value of the jth coordinate of uj. 

(d) Z{uj) = maxi(wi). 

(e) V{uj) = g{uj)X{uj), where g and X are as above. 

15. Suppose that {Xq, Xi, . . . ,Atv) is a random vector with a product 
probability density function with marginal pdf’s 

fx„(a) = l^ 0<a<l 

1 0 otherwise. 



(The components are iid.) Define the following random variables: 



• u = xl 

• y = max(Ai,A2,A3,A4) 







if Ai > 2 A 2 
otherwise 



• A random vector Y = (Yi, . . . ,Ym) is defined by 



Yn — Xn + X„-i; n — 1, . . . ,N. 



(a) Find the pdf or pmf as appropriate for U, V, and W. 
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(b) Find the cumulative distribution function (cdf) for 

16. Let / be the uniform pdf / on [0, 1], as in 3.1. Let {X, Y) be a random 
vector described by a joint pdf 

fx,Y{x, y) = f{y)f{x - y) all x, y . 

(a) Find the marginal densities fx and fy independent? 

(b) Find P{X > 1/2\Y < 1/2). 

17. In example [3.24] of the binary random process formed by taking the 
binary expansion of a uniformly distributed number on [0, 1], find the 
pmf for the random variable X„ for a fixed n. Find the pmf for the 
random vector (Xn,Xk) for fixed n and k. Consider both the cases 
where n = k and where n ^ k. Find the probability Pr(X 5 = X 12 ). 

18. Let X and Y be two random variables with joint pmf 

PxY{k,j) = j = ,N; fc = 1,2, •• • ,j. 

J + 1 

(a) Find C. 

(b) Findpy(j). 

(c) Find Px\Y{k\j)- Are X and Y independent? 

19. In example [3.27] of the random phase process, find Pr(X(t) > 1/2). 

20. Evaluate the pmf PY(t){y) for the quantized process of example [3.28] 
for each possible case. (Choose 6 = 0 if the process is nonnegative 
and b = —a otherwise.) 

21. Let ([0, 1], ,B([0, 1]), P) be a probability space with pdf /(w) = 1; w € 
[0,1]. Find a random vector {Xt; t e {1,2, .. . , n}} such that Pr(Xi = 
1) = Pr{Xt = 0) = 1/2 and Pr(Xi = 1 and Xt-i = 1) = 1/8, for 
relevant t. 

22. Give an example of two equivalent random variables (that is, two 
random variables having the same distribution) that 

(a) are defined on the same space but are not equal for any uj G 0, 

(b) are defined on different spaces and have different functional forms. 

23. Let (3?, P(3?),m) be the probability space of example 3.1. 
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(a) Define the random process {X(t); t S [0, oo)} by 



X{t,L0) 



1 if 0 < t < w 
0 otherwise . 



Find Pr(X(t) = 1) as a function of t. 

(b) Define the random process {X{t)] t G [0,oo)} by 



X{t,L0) 



t/u if 0 < t < w 
0 otherwise . 



Find Pr(X(t) > a;) as a function of t for a; G (0, 1). 

24. Two continuous random variables X and Y are described by the pdf 



fx,Y{x,y) = I 



c if |x| + \y\ < r 
0 otherwise . 



where r is a fixed real constant and c is a constant. In other words, 
the pdf is uniform on a square whose side has length \/2r. 

(a) Evaluate c in terms of r. 

(b) Find fx{x). 

(c) Are X and Y independent random variables? (Prove your an- 
swer.) 

(d) Define the random variable Z = (|A| -|- jP |). Find the pdf fz{z). 

25. Find the pdf of X{t) in example [3.23] as a function of time. Find 
the joint cdf of the vector (A(l), A(2)). 

26. Richard III wishes to trade his kingdom for a horse. He knows that 
the probability that there are k horses within r feet of him is 

2k — hY 

CH^ ; fc = 0,l,2,... , 

where > 0 is a fixed parameter. 



(a) Let R denote a random variable giving the distance from Richard 
to the nearest horse. What is the probability density function 
fnia) for R1 {C should be evaluated as part of this question.) 

(b) Rumors of the imminent arrival of Henry Tudor have led Richard 
to lower his standards and consider alternative means of trans- 
portation. Suppose that the probability density function /s(/3) 




176 



CHAPTER 3. RANDOM OBJECTS 



for the distance S to the nearest mule is the same as fa except 
that the parameter H is replaced by a parameter M. Assume 
that R and S are independent random variables. Find an ex- 
pression for the cumulative distribution function (cdf) for W, 
the distance to the nearest quadruped (i.e., horse or mule). 
Hint: If you did not complete or do not trust your answer to 
part (b), then find the answer in terms of the cdf’s for R and S. 



27. Suppose that a random vector X = (Aq, . . . , Xk-i) is iid with marginal 
pmf 



PXiil) =Px{l) 



p if I = I 
1 — p if I = 0 



for all i. 



(a) Find the pmf of the random variable Y = 

(b) Find the pmf of the random variable W = Xq + X^-i- 

(c) Find the pmf of the random vector (F, W). 



28. Find the joint cdf of the complex components of A„(w) in example 
[3.25] as a function of time. — 1/2 < x < 1/2, —1/2 < y < 1/2} 

29. Find the pdf of X(t) in example [3.27]. 

30. A certain communication system outputs a discrete time series {X„} 
where A„ has pmf px(l) = Px{~^) = 1/2. Transmission noise in 
the form of a random process {F„j is added to A„ to form a random 
process = A„ + Yn}. Yn has a Gaussian distribution with m = 0 
and (7=1. 



(a) Find the pdf of Z„. 

(b) A receiver forms a random process = sgn(Z„| where sgn is 
the sign function sgn(x) = 1, if x > 0, sgn(x) = —1, if x < 0. 
Rn is output from the receiver as the receiver’s estimate of what 
was transmitted. Find the pmf of and the probability of 
detection (i.e., Pr(i?„ = A„)). 

(c) Is this detector optimal? 

31. If A is a Gaussian random variable, find the marginal pdf /v(t) and 
for the random process Y (t) defined by 

Y{t) = A cos(27r/ot) ; f G 3? , 
where /o is a known constant frequency. 
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32. Let X and Z be the random variables of problems 3.1 through 3.3. For 
each assumption on the original density find the cdf for the random 
vector {X,Z), Fx^z{x,z). Does the appropriate derivative exist? Is 
it a valid pdf? 

33. Let be a random variable giving the number of molecules of hy- 
drogen in a spherical region of radium r and volumne V = 47rr^/3. 
Assume that N is described by a Poisson pmf 

, , e-p^{pvr 

PN{n)= 1 , n = 0,1,2,... 

n\ 

where p can be viewed as a limiting density of molecules in space. 
Say we choose an arbitrary point in deep space as the center of our 
coordinate system. Define a random variable X as the distance from 
the origin of our coordinate center to the nearest molecule. Find the 
pdf of the random variable X, fx {x) . 

34. Let P be a random variable with a uniform pdf on [0,a]. Let W be 
a random variable, independent of V , with an exponential pdf with 
parameter A, that is, 

fw{w) = Ae”^™ ; w G [0,oo) . 

Let p{t) be the pulse with value 1 when 0 < t < 1 and 0 otherwise. 
Define the random process {X{t); t G [0, oo)} by 

X{t) = Vp{t-W) , 

(This is a model of a square pulse that occurs randomly in time with 
a random amplitude.) Find for a fixed time t > I the cdf Fx(t){oi) = 
Pr(A(t) < a). You must specify the values of the cdf for all possible 
real values a. Show that there exists a pmf p with a corresponding cdf 
Fi, a pdf / with a corresponding cdf F’l, a pdf / with a corresponding 
cdf and a number (it G (0, 1) such that 

FxptM = dtF^(a) + (1 - dt)Fi(a) . 

Given expressions for p, /, and Bt. 

35. Prove the following facts about characteristic functions: 

(a) 



\Mx{ju)\ < 1 
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(b) 

Mx{0) = 1 

(c) 

\Mx{ju)\ < Mx{0) = I 

(d) If a random variable X has a characteristic function Mx{ju), if 
c is a fixed constant, and if a random variable Y is defined by 
Y = X + c, then 

My{3u) = C'^^Mxiju) . 

36. Suppose that X is a random variable described by an exponential pdf 

fx{a) = Ae“^“; a > 0. 

(A > 0.) Define a function q which maps nonnegative real numbers 
into integers by q{x) = the largest integer less than or equal to x. In 
other words 



q{x) = k ii k < X < k + 1, fc = 0, 1, • • • . 

(This function is often denoted by q{x) = \x \ .) The function g is a 
form of quantizer, it rounds its input downward to the nearest integer 
below the input. Define the following two random variables: the 
quantizer output 

y = q{X) 

and the quantizer error 



e = X-q{X). 

Note: By construction e can only take on values in [0, 1). 



(a) Find the pmf prik) for Y. 

(b) Derive the probability density function for e. (You may find 
the “divide and conquer” formula useful here, e.g., P{G) = 

P{G n Fi), where {Fi} are a partition.) 



37. Suppose that {Xi , . . . ,Xx) is a, random vector described by a product 
pdf with uniform marginal pdf’s 



fx„ (a) 



1 |a| < b 

0 otherwise. 



Define the following random variables: 
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. u = xl 

• V = X 2 ) 

• W = n if n is the smallest integer for which X„ > 1/4} and 
W = 0 if there is no such n. 

(a) Find pdf’s or pmf’s for U, V, and W. 

(b) What is the joint pdf fxi,X 3 ,x^{a, 

38. The joint probability density function of X and Y is 
fx.y{oi,(3) = C, |a| < 1, 0 < /3 < 1. 

Define a new random variable 



{U is taken to be 0 if X = 0.) 

(a) Find the constant C and the marginal probability density func- 
tions fx{a) and /y(/3). 

(b) Find the probability density function fu{l) for U . 

(c) Suppose that U is quantized into q{U) by defining 

q{U) = i for di-i <U < di] i= 1,2,3, 

where the interval [do, ds) equals the range of possible values of 
U. Find the quantization levels di, t = 0, 1,2,3 such that q{U) 
has a uniform probability mass function. 

39. Let {X, Y) be a random vector described by a product pdf fxy{x, y) = 
fx{x)fY{y)- Let Fx and Fy denote the corresponding marginal cdf’s. 

(a) Prove 

/ OO pOO 

Fy{x)fx{x) dx=\- / fy{x)Fx{x) dx 

-00 J —00 

(b) Assume, in addition, that X and Y are identically distributed, 
i.e., have the same pdf. Based on the result of (a) calculate the 
probability P{X > Y). {Hint: You should be able to derive or 
check your answer based on symmetry.) 
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40. You have 2 coins and a spinning pointer U. The coins are fair and 
unbiased, and the pointer U has a uniform distribution over [0,1). 
You flip the both coins and spin the pointer. A random variable X is 
defined as follows: 

If the first coin is “heads”, then: 

^ _ f 1 if the 2nd coin is “heads” 

[ 0 otherwise 

If the first coin is “tails” , then X = U + 2. 

Define another random variable: 

Y — f “heads” 

2U + 1 otherwise 

(a) Find Fx{x). 

(b) Find Pr(i < A < 2|). 

(c) Sketch the pdf of Y and label important values. 

(d) Design an optimal detection rule to estimate U if you are given 
only Y. What is the probability of error? 

(e) State how to, or explain why it is not possible to: 

i. Generate a binary random variable Z, pz(l) = p, given U7 

ii. Generate a continuous, uniformly distributed random vari- 
able given Z? 

41. The random vector W = {Wq, Wi, W 2 ) is described by the pdf fw{x, y, 

C\z\, for < 1, jzj < 1. 

(a) Find C . 

(b) Determine whether the following variables are independent and 
justify your position: 

i. Wo and Wi 

ii. Wo and W 2 

iii. Wi and W 2 

iv. Wo and Wi and W 2 

(c) Find Pr(W 2 > |). 

(d) Find Fwo,rv2(0>0)- 

(e) Find the cdf of the vector W. 
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(f) Let V = Find Pr{V > 0). 

(g) Find the pdf of M, where M = min(hFf + 1F|, 1F|). 

42. Suppose that X and Y are random variables and that the joint pmf 
is 



Px.v(A:,j) = c2-'=2(^-'=); = 0, 1, 2, • • • ; j = fc, fc + 1, • • • . 

(a) Find c. 

(b) Find the pmf’s px{j) and pyU)- 

(c) Find the conditional pmf’s Px\Y{k\j) and PY\xU\k)- 

(d) Find the probability that Y > 2X. 

43. Suppose that X = (Xq, Xi, . . . , Xk-i) is a random vector {k is some 
large number) with joint pdf 



/x(x) 



1 if 0 < Xi < 1; t = 0, . . . , /c — 1 
0 else 



Define the random variables V = Xq + Xio and W = max(Xo,Xio). 
Define the random vector Y : 



Y„ = 2”X„; n = 0,... ,A:-1, 

(a) Find the joint pdf fv,w(x,w). 

(b) Find the probabilities Pr(lF < 1/2), Pr(y < 1/2), and Pr(lF < 
1/2 and V < 1/2). 

(c) Are W and V independent? 

(d) Find the (joint) pdf for Y. 



44. The random process described in example [ 3 . 26 ] is an example of 
a class of processes that is currently somewhat of a fad in scientific 
circles, it is a chaotic. (See, e.g.. Chaos by James Gleick (1987).) Sup- 
pose as in Example [ 3 . 26 ] Xq{uj) = a; is chosen at random according 
to a uniform distribution on [0, 1), that is, the pdf is 



/xo(a) 



1 if a G [0, 1) 
0 else . 



As in the example, the remainder of the process is defined recursively 

by 

X„{uj) = 2X„_i(a;) mod 1, n = 1, 2, • • • . 
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Note that if the initial value Xq is known, the remainder of the process 
is also known. 

Find a nonrecursive expression for X„(w), that is, write Xn{tS) di- 
rectly as a function of to, e.g., Xn(co) = g{uj) mod 1. 

Find the pdf fxi(cx) and fx„{ce). 

Hint: after you have found fx^, try induction. 

45. Another random process which resembles that of the previous process 
but which is not chaotic is to define Xq in the same way, but define 
X„ by 

Ai„(a’) = {X„_i{lo) + Xq{lo)) mod 1. 

Here Xi is equivalent to that of the previous problem, but the sub- 
sequent Xn are different. As in the previous problem, find a direct 
formula for X„ in terms of lv (e.g., Xn{iv) = h(oj) mod 1) and find 
the pdf fx„{ce). 

46. The Mongol general Subudai is expecting reinforcements from Cheng- 
gis Kahn before attacking King Bela of Hungary. The probability 
mass function describing the number N of lumens (units of 10,000 
men) that he will receive is 

PN{k) = cp'"; fc = 0, 1,-- - . 

If he receives N = k tumens, then his probability of losing the battle 
will be 2“^. This can be described by defining the random variable 
W which will be 1 if the battle is won, 0 if the battle is lost, and 
defining the conditional probability mass function 

Pw\N{iTi\k) = Pr(lF = m|A^ = k) = 

(a) Find c. 

(b) Find the (unconditional) pmipwirn), that is, what is the prob- 
ability that Subudai will win or lose? 

(c) Suppose that Subudai is informed that definitely N < 10. What 
is the new (conditional) pmf for TV? (That is, find Pr(7V = 
k\N < 10).) 

47. Suppose that {X^, n = 0, 1, 2, • • • } is a binary Bernoulli process, that 
is, an iid process with marginal pmf’s 



2-'= m = 0 
1-2"'= m=l. 



Px„ (k) 



p if A: = 1 
1 — p if A: = 0 
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for all n. Suppose that {Wn, n = 0, 1, • • • } is another binary Bernoulli 
process with parameter e, that is, 



Pw„{k) 



if fc = 1 



1 - e 



if fc = 0 



We assume that the two random processes are completely independent 
of each other (that is, any collection of samples of Xn is independent 
from any collection of Wn). We form a new random process {Y„; n = 
0, 1, • • • } by defining 

r„ = 0 w„, 

where the 0 operation denotes mod 2 addition. This setup can be 
thought of as taking an input digital signal and sending it across 
a binary channel to a receiver. The binary channel can cause an 
error between the input and output with probability e. Such 
a communication channel is called an additive noise channel because 
the output is the input plus an independent noise process (where 
“plus” here means mod 2). 

(a) Find the output marginal pmf py^(fc). 

(b) Is {Yn} Bernoulli? That is, is it an iid process? 

(c) Find the conditional pmf Py„|x„(j|^)- 

(d) Find the conditional pmf px„|y„ (^|j)- 

(e) Find an expression for the probability of error Pr(Yn yf Xn). 

(f) Suppose that the receiver is allowed to think about what the 
best guess for is given it receives a value F„. In other words, 
if you are told that Y„ = j, you can form an estimate or guess 
of the input X„ by some function of j, say X(j). Given this 
estimate your new probability of error is given by 

Pe=Pr{X{Yn)y^Xn). 

What decision rule X{j) yields the smallest possible Pe? What 
is the resulting P^l 

48. Suppose that we have a pair of random variables {X, Y) with a mixed 
discrete and continuous distribution as follows. T is a binary {0, 1} 
random variable described by a pmf py(l) = 0.5. Conditioned on 
Y = y, X is continuous with a Gaussian distribution with mean cr^ 
and mean y, that is, 

fx\Y{x\y){x\y) = ; a; G 3?; y = 0, 1 . 

V 27Tcr^ 
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This can be thought of as the result of communicating a binary sym- 
bol (a “bit”) over a noisy channel, which adds 0 mean variance a'^ 
Gaussian noise to the bit. In other words, X = Y + IT, where IT is a 
Gaussian random variable, independent of Y . What is the optimum 
(minimum error probability) decision for Y given the observation XI 
Write an expression for the resulting error probability. 

Find the multidimensional Gaussian characteristic function of equa- 
tion (3.126) by completing the square in the exponent of the defining 
multidimensional integral. 
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Chapter 4 



Expectation and Averages 



4.1 Averages 

In engineering practice we are often interested in the average behavior of 
measurements on random processes. The goal of this chapter is to link the 
two distinct types of averages that are used — long-term time averages 
taken by calculations on an actual physical realization of a random process 
and averages calculated theoretically by probabilistic averages at some given 
instant of time, averages that are sometimes called expectations. As we 
shall see, both computations often (but by no means always) give the same 
answer. Such results are called laws of large numbers or ergodic theorems. 

At first glance from a conceptual point of view, it seems unlikely that 
long-term time averages and instantaneous probabilistic averages would be 
the same. If we take a long-term time average of a particular realization of 
the random process, say {X{t, wq); t G T}, we are averaging for a particular 
eo — an w which we cannot know or choose; we do not use probability in 
any way and we are ignoring what happens with other values of oj. Here 
the averages are computed by summing the sequence or integrating the 
waveform over t while coq stays fixed. If, on the other hand, we take an 
instantaneous probabilistic average, say at the time to, we are taking a 
probabilistic average and summing or integrating over oj for the random 
variable X(to,oj). Thus we have two averages, one along the time axis with 
Lo fixed, the other along the oj axis with time fixed. It seems that there 
should be no reason for the answers to agree. Taking a more practical 
point of view, however, it seems that the time and probabilistic averages 
must be the same in many situations. For example, suppose that you 
measure the percentage of time that a particular noise voltage exceeds 10 
volts. If you make the measurement over a sufficiently long period of time. 
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the result should be a reasonably good estimate of the probability that the 
noise voltage exceeds 10 volts at any given instant of time — a probabilistic 
average value. 

To proceed further, for simplicity we concentrate on a discrete alphabet 
discrete time random process. Other cases are considered by converting 
appropriate sums into integrals. Let {Xn} be an arbitrary discrete alpha- 
bet discrete time process. Since the process is random, we cannot predict 
accurately its instantaneous or short-term behavior — we can only make 
probabilistic statements. Based on experience with coins, dice, and roulette 
wheels, however, one expects that the long-term average behavior can be 
characterized with more accuracy. For example, if one flips a fair coin, short 
sequences of flips are unpredictable. However, if one flips long enough, one 
would expect to have an average of about 50% of the flips result in heads. 
This is a time average of an instantaneous function of a random process — 
a type of counting function that we will consider extensively. It is obvious 
that there are many functions that we can average, i.e., the average value, 
the average power, etc. We will proceed by defining one particular average, 
the sample average value of the random process, which is formulated as 

n—1 

Sn = n~^^X^ ; n= 1,2,3,... 

i=0 

We will investigate the behavior of Sn for large n, i.e., for a long-term time 
average. Thus, for example, if the random process {Xn} is the coin-flipping 
model, the binary process with alphabet {0,1}, then S'„ is the number of I’s 
divided by the total number of flips — the fraction of flips that produced a 
1. As noted before, S'„ should be close to 50% for large n if the coin is fair. 

Note that, as in example [3.7], for each n, S„ is a random variable that 
is defined on the same probability space as the random process {A„|. This 
is made explicit by writing the co dependence: 

- n—1 

5„M = - VXfcM . 
n ^ ' 

k=0 

In more direct analogy to example [3.7], we can consider the |Ai„} as co- 
ordinate functions on a sequence space, say (3?^, ,8(3?^), m), where m is 
the distribution of the process, in which case Sn is defined directly on the 
sequence space. The form of definition is simply a matter of semantics or 
convenience. Observe, however, that in any case {Sn', n = 1, 2, . . . } is itself 
a random process since it is an indexed family of random variables defined 
on a probability space. 
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For the discrete alphabet random process that we are considering, we 
can rewrite the sum in another form by grouping together all equal terms: 

-5'nH = (4.1) 

a&A 

where A is the range space of the discrete alphabet random variable 
and r^\uj) = n~^ [number of occurrences of the letter a in {Xi(u;), i = 
0, 1, 2, . . . , n— 1}]. The random variable is called the order relative 
frequency or of the symbol a. Note that for the binary coin flipping example 
we have considered, A = {0, 1}, and S'„(w) = r^\uj), the average number 
of heads in the first n flips. In other words, for the binary coin-flipping 
example, the sample average and the relative frequency of heads are the 
same quantity. More generally, the reader should note that can always 
be written as the sample average of the indicator function for a, la (a:): 



z=0 



where 



la(a;) 



1 if X = a 
0 otherwise. 



Note that l{o} is a more precise, but more clumsy, notation for the indicator 
function of the singleton set {a}. We shall use the shorter form here. 

Let us now assume that all of the marginal pmf ’s of the given process are 
the same, say px{x), x G A. Based on intuition and gambling experience, 
one might suspect that as n goes to infinity, the relative frequency of a 
symbol a should go to its probability of occurrence, px (a) • To continue the 
example of binary coin flipping, the relative frequency of heads in n tosses 
of a fair coin should tend to 1/2 as n ^ oo. If these statements are true, 
that is, if in some sense. 



' a 



Px{a) , 



(4.2) 



then it follows that in a similar sense 



S'„ ^ apx{a) , 

n— ^oo ' ^ 

a^A 



(4.3) 



the same expression as (4.1) with the relative frequency replaced by the 
pmf. The formula on the right is an example of an expectation of a random 
variable, a weighted average with respect to a probability measure. The 
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formula should be recognized as a special case of the definition of expecta- 
tion of (2.34), where the pmf is px and g{x) = x, the identity function. The 
previous plausibility argument motivates studying such weighted averages 
because they will characterize the limiting behavior of time averages in the 
same way that probabilities characterize the limiting behavior of relative 
frequencies. 

Limiting statements of the form of (4.2) and (4.3) are called laws of 
large numbers or ergodic theorems. They relate long-run sample averages 
or time average behavior to probabilistic calculations made at any given 
instant of time. It is obvious that such laws or theorems do not always 
hold. If the coin we are flipping wears in a known fashion with time so that 
the probability of a head changes, then one could hardly expect that the 
relative frequency of heads would equal the probability of heads at time 
zero. 

In order to make precise statements and to develop conditions under 
which the laws of theorems do hold, we first need to develop the properties 
of the quantity on the right-hand side of (4.2) and (4.3). In particular, we 
cannot at this point make any sense out of a statement like “lim„^oo Sn = 
y^ apx(a),” since we have no definition for such a limit of random variables 

aeA 

or functions of random variables. It is obvious, however, that the usual 
definition of a limit used in calculus will not do, because is a random 
variable albeit a random variable whose “randomness” decreases in some 
sense with increasing n. Thus the limit must be defined in some fashion 
that involves probability. Such limits are deferred to a later section and we 
begin by looking at the definitions and calculus of expectations. 



4.2 Expectation 

Given a discrete alphabet random variable X specified by a pmf px , define 
the expected value, probabilistic average, or mean of X by 

= X! ■ (4-4) 

x&A 

The expectation is also denoted by EX or E[X] or by an overbar, as 
X. The expectation is also sometimes called an ensemble average to denote 
averaging across the ensemble of sequences that is generated for different 
values of w at a given instant of time. 

The astute reader might note that we have really provided two defi- 
nitions of the expectation of X. The definition of (4.4) has already been 
noted to be a special case of (2.34) with pmf px and function g{x) = x. 
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Alternatively, we could use (2.34) in a more fundamental form and con- 
sider g(co) = is a function defined on an underlying probability space 

described by a pmf p or a pdf /, in which case (2.34) or (2.57) provide a dif- 
ferent formula for finding the expection in terms of the original probability 
function: 



E(X)=J2^Hp(^) ( 4 - 5 ) 

if the original space is discrete, or 

E{X) = j X{T)f{r)dr (4.6) 

if it is described by a pdf. Are these two versions consistent? The answer 
is yes, as will be proved soon by the fundamental theorem of expectation. 
The equivalence of these forms is essentially a change of variables formula. 

The mean of a random variable is a weighted average of the possible 
values of the random variable with the pmf used as a weighting. Before 
continuing, observe that we can define an analogous quantity for a continu- 
ous random variable possessing a pdf: If the random variable X is described 
by a pdf fx, then we define the expectation of X by 

EX = J xfx{x)dx, (4.7) 

where we have replaced the sum by an integral. Analogous to the discrete 
case, this formula is a special case of (2.57) with pdf f = fx and g being 
the identity function. We can also use (2.57) to express the expectation in 
terms of an underlying pdf, say /, with ^ = A by the formula 

AA = y X{r)f{x)dr . (4.8) 

The equivalence of these two formulas will be considered when the funda- 
mental theorem of expectation is treated. 

While the integral does not have the intuitive motivation involving a 
relative frequency converging to a pmf that the earlier sum did, we shall 
see that it plays the analogous role in the laws of large numbers. Roughly 
speaking, this is because continuous random variables can be approximated 
by discrete random variables arbitrarily closely by very fine quantization. 
Through this procedure, the integrals with pdfs are approximated by sums 
with pmf’s and the discrete alphabet results imply the continuous alphabet 
results by taking appropriate limits. Because of the direct analogy, we 
shall develop the properties of expectations for continuous random variables 
along with those for discrete alphabet random variables. Note in passing 
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that, analogous to using the Stieltjes integral as a unified notation for sums 
and integrals when computing probabilities, the same thing can be done 
for expectations. If Ex is the cdf of a random variable X, define 

'^xpx{x) if X is discrete 
xfx (x) dx if X has a pdf. 

In a similar manner, we can define the expectation of a mixture random 
variable having both continuous and discrete parts in a manner analogous 
to (3.36). 

4.2.1 Examples: Expectation 

The following examples provide some typical expectation computations. 

[ 4 . 1 ] As a slight generalization of the fair coin flip, consider the more gen- 
eral binary pmf with parameter p; that is, px(l) = P and pjc(O) = 
1 — p. In this case 

1 

EX = xpx (x) = 0(1 - p) + Ip = p . 

i=0 

It is interesting to note that in this example, as is generally true for 
discrete random variables, EX is not necessarily in the alphabet of 
the random variable, i.e., EX yf 0 or 1 unless p = 0 or 1. 

[ 4 . 2 ] A more complicated discrete example is a geometric random variable. 
In this case 

OO OO 

EX = J2 kpx(k) = kp{^ - P)"~" > 

k^l k^l 

a sum evaluated in (2.48) as l/p. 

[ 4 . 3 ] As an example of a continuous random variable, assume that A is a 
uniform random variable on [0, 1], that is, that its density is one on 
[0,1]. Here 



EX — J xdFx(x) ~ I / 



EX = 



/ xfx(x)dx= / xdx=l/2 

Jo Jo 



an integral evaluated in (2.67). 
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[ 4 . 4 ] If X is an exponentially distributed random variable with parameter 
A, then from (2.71) 





(4.9) 



In some case expectations can be found virtually by inspection. For 
example, if X has an even pdf fx — that is, if fx{—x) = fx{x) for all 
X e Sft — then if the integral exists, EX = 0, since xfx{x) is an odd function 
and hence has a zero integral. The assumption that the integral exists 
is necessary because not all even functions are integrable. For example, 
suppose that we have a pdf fx{x) = cjx^ for all \x\ > 1, where c is a 
normalization constant. Then it is not true that EX is zero, even though 
the pdf is even, because the Riemann integral 



/ I I 
/x: |tc|>l 



does not exist. (The puzzled reader should review the definition of indefinite 
integrals. Their existence requires that the limit 



lim lim / xfx(x)dx 

T^oo S^oo J_rp ^ ' 



exists regardless of how T and S tend to infinity; in particular, the existence 
for the limit with the constraint T = S' is not sufficient for the existence of 
the integral. These limits do not exist for the given example because 1/x 
is not integrable on [l,oo).) Nonetheless, it is convenient to set EX to 0 
in this example because of the obvious intuitive interpretation. 

Sometimes the pdf is an even function about some nonzero value, that 
is, fx{x + m) = fx{x — m), where m is some constant. In this case, 
it is easily seen that if if the expectation exists, then EX = m, as the 
reader can quickly verify by a change of variable in the integral defining 
the expectation. The most important example of this is the Gaussian pdf, 
which is even about the constant m. 

The same conclusions also obviously hold for an even pmf. 
sectionExpectations of Functions of Random Variables In addition to 
the expectation of a given random variable, we will often be interested in 
the expectations of other random variables formed as functions of the given 
one. In the beginning of the chapter we introduced the relative frequency 

(n) 

function. To , which counts the relative number of occurrences of the value 
a in a sequence of n terms. We are interested in its expected value and in the 
expected value of the indicator function that appears in the expression for 
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ri"^ . More generally, given a random variable X and a function g : 3? ^ 3?, 
we might wish to find the expectation of the random variable Y = g(X). 
If X corresponds to a voltage measurement and g is a simple squaring 
operation, g(X) = X^, then g(X) provides the instantaneous energy across 
a unit resistor. Its expected value, then, represents the probabilistic average 
energy. More generally than the square of a random variable, the moments 
of a random variable X are defined by E[X^\ for fc = 1, 2, . . . . The mean is 
the first moment, the square is the second moment, and so on. Moments are 
often useful as general parameters of a distribution, providing information 
on its shape without requiring the complete pdf or pmf. Some distributions 
are completely characterized by a few moments. It is often useful to consider 
moments of a “centralized” random variable formed by removing its mean. 
The kth centralized moment is defined by E\{X — E{X))^]. Of particular 

interest is the second centralized moment or variance = E[{X — E{X)y']. 
Other functions that are of interest are indicator functions of a set, iF(a^) = 
\ ii X £ F and 0 otherwise, so that If(^) is a binary random variable 
indicating whether or not the value of X lies in F , and complex exponentials 

ejux^ 

Expectations of functions of random variables were defined in this chap- 
ter in terms of the derived distribution for the new random variable. In 
chapter 2, however, they were defined in terms of the original pmf or pdf in 
the underlying probability space, a formula not requiring that the new dis- 
tribution be derived. We next show that the two formulas are consistent. 
First consider finding the expectation of Y by using derived distribution 
techniques to find the probability function for Y and then use the defini- 
tion of expectation to evaluate EY. Specifically, if X is discrete, the pmf 
for Y is found as before as 

Pviy) = ^ Px{x), y G Ay. 

X- g(x)=y 



EY is then found as 

EY = '^ypy{y) ■ 

Ay 

Although it is straightforward to find the probability function for Y, it 
can be a nuisance if it is being found only as a step in the evaluation of 
the expectation EY = Eg{X). A second and easier method of finding 
EY is normally used. Looking at the formula for EX, it seems intuitively 
obvious that E{g{X)) should result if x is replaced by g{x). This can be 
proved by the following simple procedure. Starting with the pmf for Y, 
then substituting for its expression in terms of the pmf of X and reordering 
the summation, the expectation of Y is found directly from the pmf for X 
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as claimed: 



EY 



^ypriy) 

Ay 






Ay 



vx: g(x)=y 



X X 9(^)Px(.x) 



Ay \x:g(x)^y 



X5(a^W(2^) • 

Ax 



This little bit of manipulation is given the fancy name of the fundamen- 
tal theorem of expectation. It is a very useful formula in that it allows 
the computation of expectations of functions of random variables without 
the necessity of performing the (usually more difficult) derived distribution 
operations. 

A similar proof holds for the case of a discrete random variable defined 
on a continuous probability space described by a pdf. The proof is left as 
an exercise (problem 4.3). 

A similar change of variables argument with integrals in place of sums 
yields the analogous pdf result for continuous random variables. As is 
customary, however, we have only provided the proof for the simple discrete 
case. For the details of the continuous case, we refer the reader to books 
on integration or analysis. The reader should be aware that such integral 
results will have additional technical assumptions (almost always satisfied) 
required to guarantee the existence of the various integrals. We summarize 
the results below. 

Theorem 4.1 The Fundamental Theorem of Expectation. 

Let a random variable X be described by a cdf Fx, which is in turn 
described by either a pmf px or a pdf fx- Given any measurable function 
g : 3? ^ 5ft, the resulting random variable Y = g{X) has expectation 

EY = E{g{X)) = JydFg^xM 

= j 9 (x) dFx 






g{x)fx{x) dx 
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The qualification “measurable” is needed in the theorem to guarantee 
the existence of the expectation. Measurability is satisfied by almost any 
function that you can think of and, for all practical purposes, can be ne- 
glected. 

As a simple example of the use of this formula, consider a random 
variable X with a uniform pdf on [—1/2, 1/2]. Define the random variable 
V = X^, that is g(r) = . We can use the derived distribution formula 

(3.40) to write 

= ; y > 0 , 

and hence 

fviy) = y~^^^ ; y g (0, 1/4] , 

where we have used the fact that is 1 only if the nonnegative 

argument is less than 1/2 or y < 1/4. We can then find EY as 

EY = J yfy{y)dy = = 

Alternatively, we can use the theorem to write 

EY = E{X^) = j ' x^dx = 2 1 . 

Note that the result is the same for each method. However, the second 
calculation is much simpler, especially if one considers the work which has 
already been done in chapter 3 in deriving the density formula for the square 
of a random variable. 

[4.5] A second example generalizes an observation of chapter 2 and shows 
that expectations can be used to express probabilities (and hence that 
probabilities can be considered as special cases of expectation) . Recall 
that the indicator function of an event F is defined by 

/ \ / 1 if X G A 

F(x) Q otherwise . 

The probability of the event F can be written in the following form 
which is convenient in certain computations: 

E1f{X) = J lj^(a:) dFx{x) = dFx{x) = Px{F) , (4.10) 

where we have used the universal Stieltjes integral representation of 
(3.32) to save writing out both sums of pmf’s and integrals of pdf’s 
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(the reader who is unconvinced by (4.10) should write out the specific 
pmf and pdf forms). Observe also that finding probability by taking 
expectations of indicator functions is like finding a relative frequency 
by taking a sample average of an indicator function. 

It is obvious from the fundamental theorem of expectation that the 
expected value of any function of a random value can be calculated from 
its probability distribution. The preceding example demonstrates that the 
converse is also true: The probability distribution can be calculated from 
a knowledge of the expectation of a large enough set of functions of the 
random variable. The example provides the result for the set of all indicator 
functions. The choice is not unique, as shown by the following example: 

[4.6] Let g(x) be the complex function e-^“^ where u is an arbitrary con- 
stant. For a cdf Fx, define 

E{g{X)) = E{C^^) = J C-^dFxix) . 

This expectation is immediately recognizable as the characteristic 
function of the random variable (or its distribution), providing a 
shorthand definition 

Mxiju) = E[C^^]. 



In addition to its use in deriving distributions for sums of independent 
random variables, the characteristic function can be used to compute mo- 
ments of a random variable (as the Fourier transform can be used to find 
moments of a signal). For example, consider the discrete case and take a 
derivative of the characteristic function Mx (ju) with respect to u: 



A. 

du 



Mxiju) 



_d_ 

du 






'^Px{x)ijx)C^^ 



and evaluate the derivative at w = 0 to find that 

Mx'iO) = ■^Mxiju)\u=o=jEX. 
du 

Thus the mean of a random variable can be found by differentiating the 
characteristic function and setting the argument to 0 as 



EX 



Mx'iO) 



J 



(4.11) 
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Repeated differentiation can be used to show more generally that the 
/cth moment can be found as 

E[X’^] = = j->^^Mx{ju)U=o (4.12) 

If one needs several moments of a given random variable, it is usually easier 
to do one integration to find the characteristic function and then several 
differentiations than it is to do the several integrations necessary to find 
the moments directly. Note that if we make the substitution w = ju and 
differentiate with respect to w, instead of u, 

MxHL=o = E{X>^) . 

dw'^ 

Because of this property, characteristics function with ju = w are called 
moment- generating functions. From the defining sum or integral for char- 
acteristic functions in example [4.6], the moment-generating function may 
not exist for all w = v -I- ju, even when it exists for all w = ju with u real. 
This is a variation on the idea that a Laplace transform might not exist for 
all complex frequencies s = a jui even when the it exists for all s = ju 
with oj real, that is, the Fourier transform exists. 

Example [4.6] illustrates an obvious extension of the fundamental the- 
orem of expectation. In [4.6] the complex function is actually a vector 
function of length 2. Thus it is seen that the theorem is valid for vector 
functions, g{x), as well as for scalar functions, g{x). The expectation of a 
vector is simply the vector of expected values of the components. 

As a simple example, recall from (3.112) that the characteristic function 
of a binary random variable X with parameter p = pjc(l) = 1 — px(0) is 

Mx{ju) = (1 — p) -|-pe^“ . (4.13) 



It is easily seen that 

MaM=p = e[X] , -Mx{2){0)=p = E[X% 

J 

As another example, consider JV{m, cr^) the Gaussian pdf with mean m 
and variance Differentiating easily yields 

MxJ9l=m = ElX] , -Mx(2)(0) = a]^-hm^ = E[X^]. 



The relationship between the characteristic function of a distribution 
and the moments of a distribution becomes particularly striking when the 
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characteristic function is sufficiently nice near the origin to possess a Taylor 
series expansion. The Taylor series of a function f(u) about the point u = 0 
has the form 



/(w) = 



J«(0) 






k! 



= /(O) + + terms in ; fc > 3 , (4.14) 



where the derivatives 



au'^ 



are assumed to exist, that is, the function is assumed to be analytic at the 
origin. Combining the Taylor series expansion with the moment-generating 
property (4.12) yields 



Mx(ju) = ^7 









= ^(j4 



fe =0 



M 

^E{X^) 

k\ 



= l+juE{X)-u'^E{X‘^)+o{u^)/2 



(4.15) 



This result has an interesting implication: knowing all of the moments 
of the random variable is equivalent to knowing the behavior of the charac- 
teristic function near the origin. If the characteristic function is sufficiently 
well behaved for the Taylor series to be valid over the entire range of u 
rather than just in the area around 0, then knowing all of the moments of 
a random variable is sufficient to know the transform. Since the transform 
in turn implies the distribution, this guarantees that knowing all of the 
moments of a random variable completely describes the distribution. This 
is true, however, only when the distribution is sufficiently “nice,” that is, 
when the technical conditions ensuring the existence of all of the required 
derivatives and of the convergence of the Taylor series hold. 

The approximation of (4.15) plays an important role in the central limit 
theorem, so it is worth pointing out that it holds under even more general 
conditions than having an analytic function. In particular, if X has a second 
moment so that E[X'^] < oo, then 



u^E(X^) 



o(u^). 



Mx{ju) = l+juE{X) 



2 



(4.16) 
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where o(u^) contains higher order terms that go to zero as u ^ 0 faster 
than u^. See, for example, Breiman’s treatment of characteristic functions 
[ 6 ]. 

The most important application of the characteristic function is its use 
in deriving properties of sums of independent random variables, as was be 
seen in (3.111). 



4.3 Functions of Several Random Variables 



Thus far expectations have been considered for functions of a single random 
variable, but it will often be necessary to treat functions of multiple random 
variables such as sums, products, maxima, and minima. For example, given 
random variables U and V defined on a common probability space we might 
wish to find the expectation of V = g(U, V). The fundamental theorem of 
expectation has a natural extension (which is proved in the same way). 



Theorem 4.2 Fundamental Theorem of Expectation for Functions of Sev- 
eral Random Variables 

Given random variables Xq, Xi, . . . , Xf^_i described by a cdf Fx^^Xi,... ,Xk 
and given a measurable function g : 3?^ ^ 3?, 



E[g{Xo,... ,Xk-i)] 

= j g{xo, ■ ■ . , Xk-i) dFxo,... ,Xk-i (a;o, ■ • ■ , Xk-i) 

f X! 9{xo, ■■■ , Xk-i)pxo,... (a^o, • ■ • , Xk-i) 



= < 



Xq,... ,Xk~l 

or 



J g{xo, ■■■ , Xk-i)fxo,... ,Xk-i {xo,--- , Xk-i)dxo . . . dxk-i ■ 



As examples of expectation of several random variables we will consider 
correlation, covariance, multidimensional characteristic functions, and dif- 
ferential entropy. First, however, we develop some simple and important 
properties of expectation that will be needed. 



4.4 Properties of Expectation 

Expectation possesses several basic properties that will prove useful. We 
now present these properties and prove them for the discrete case. The 
continuous results follow by using integrals in place of sums. 
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Property 1. If X is a random variable such that Pr(X > 0) = 1, then 
EX > 0. 

Proof. Pr(X > 0) = 1 implies that the pmf px{x) = 0 for x < 0. If 
Px{x) is nonzero only for nonnegative x, then the sum defining the expec- 
tation contains only terms xpx{x) > 0, and hence them sum and EX are 
nonnegative. Note that property 1 parallels Axiom 2.1 of probability. That 
is, the nonnegativity of probability measures implies property 1. 



Property 2. If A is a random variable such that for some fixed number 
r, Pr(A = r) = 1, then EX = r. Thus the expectation of a constant equals 
the constant. 

Proof. Pr(A = r) = 1 implies that px{r) = 1. Thus the result follows 
from the definition of expectation. Observe that property 2 parallels Axiom 
2.2 of probability. That is, the normalization of the total probability to 1 
leaves the constant unsealed in the result. If total probability were different 
from 1, the expectation of a constant as defined would be a different, scaled 
value of the constant. 



Property 3. Expectation is linear; that is, given two random variables 
X and Y and two real constants a and 6, 

E{aX + bY) = aEX + bEY . 

Proof. For simplicity we focus on the discrete case, the proof for pdf’s 
is the obvious analog. Let g{x,y) = ax + by, where a and b are constants. 
In this case the fundamental theorem of expectation for functions of several 
(here two) random variables implies that 

E[aX + bY] = y^fax + by)px,y{x, y) 

x,y 

= Px,Y{x,y) -f b E^E Px,Y{x,y) 

X y y X 

Using the consistency of marginal and joint pmf’s of (3.13)-(3.14) this 
becomes 



E[aX + bY] 



a ^Px (a;) + 6 E yPY ( 2 ^) 

X y 

aE{X) + bE{Y). 



(4.17) 
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Keep in mind that this result has nothing to do with whether or not the 
random variables are independent. 

The linearity of expectation follows from the additivity of probability. 
That is, the summing out of joint pmf ’s to get marginal pmf ’s in the proof 
was a direct consequence of Axiom 2.4 . The alert reader will likely have 
noticed the method behind the presentation of the properties of expecta- 
tion: each follows directly from the corresponding axiom of probability. 
Furthermore, using (4.10), the converse is true: That is, instead of starting 
with the axioms of probability, suppose we start by using the properties of 
expectation as the axioms of expectation. Then the axioms of probability 
become the derived properties of probability. Thus the first three axioms of 
probability and the first three properties of expectation are dual; one can 
start with either and get the other. One might suspect that to get a useful 
theory based on expectation, one would require a property analogous to 
Axiom 2.4 of probability, that is, a limiting form of expectation property 3. 
This is, in fact, the case, and the fourth basic property of expectation is the 
countably infinite version of property 3. When dealing with expectations, 
however, the fourth property is more often stated as a continuity property, 
that is, in a form analogous to Axiom 2.4 of probability given in equation 
(2.28). For reference we state the property below without proof: 



Property 4. Given an increasing sequence of nonnegative random 
variables A„; n = 0, 1,2, . . . , that is, X„ > A„_i for all n (i.e., X„{oj) > 
Xn-i{uj) for all w G fl), which converge to a limiting random variable 
X = lim„^oo Xn, then 



E 



( lim Xn) 

\n—*oo J 



lim EXn ■ 

n^oo 



Thus as with probabilities, one can in certain cases exchange the order 
of limits and expectation. The cases include but are not limited to those 
of property 4. Property 4 is called the monotone convergence theorem and 
is one of the basic properties of integration as well as expectation. This 
theorem is discussed in appendix B along with another important limiting 
result, the dominated convergence theorem. 

In fact, the four properties of expectation can be taken as a definition 
of an integral (viz., the Stieltjes integral) and used to develop the general 
Lebesgue theory of integration. That is, the theory of expectation is really 
just a specialization of the theory of integration. The duality between 
probability and expectation is just a special case of the duality between 
measure theory and the theory of integration. 
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4.5 Examples: Functions of Several Random 
Variables 

4.5.1 Correlation 

We next introduce the idea of correlation or expection of products of ran- 
dom variables that will lead to the development of a property of expectation 
that is special to independent random variables. A weak form of this prop- 
erty will be seen to provide a weak form of independence that will later be 
useful in characterizing certain random processes. Correlations will later 
be seen to play a fundamental role in many signal processing applications. 
Suppose we have two independent random variables X and V and we have 
two functions or measurements on these random variables, say g(X) and 
h(Y), where g : 3? ^ 3?, /i : 3? ^ 3?, and E[g{X)\ and E[h{Y)\ exist 
and are finite. Consider the expected value of the product of these two 
functions, called the correlation between g{X) and h(Y). Applying the 
two-dimensional vector case of the fundamental theorem of expectation to 
discrete random variables results in 

E{g{X)h{Y)) = '^g{x)h{y)px,Y{x,y) 

x,y 

= '^'^9{x)h{y)px{x)pY{y) 

X y 

(^Ky)pYiy) 

= {E{g{X))){E{h{Y)))\ 

A similar manipulation with integrals shows the same to be true for random 
variables possessing pdf’s. Thus we have proved the following result, which 
we state formally as a lemma. 

Lemma 4.1 For any two independent random variables X and Y, 

E{g{X)h{Y)) = {Eg{X)){Eh{Y)) ( 4 . 18 ) 

for all functions g and h with finite expectation. 

By stating that the functions have finite expectation we implicitly as- 
sume them to be measurable, i.e., to have a distribution with respect to 
which we can evaluate an expectation. Measurability is satisfied by al- 
most all functions so that the qualification can be ignored for all practical 
purposes. 
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To cite the most important example, if g and h are identity functions 
(h(r) = g(r) = r), then we have that independence of X and Y implies 
that 



E{XY) = {EX){EY) , (4.19) 



that is, the correlation of X and Y is the product of the means, in which 
case the two random variables are said to be uncorrelated. (The term linear 
independence is sometimes used as a synonym for uncorrelated.) 

We have shown that if two discrete random variables are independent, 
then they are also uncorrelated. Note that independence implies not only 
that two random variables are uncorrelated but also that all functions of 
the random variables are uncorrelated — a much stronger property. In 
particular, two uncorrelated random variables need not be independent. 
For example, consider two random variables X and Y with the joint pmf 



Px,Y{x,y) 



1/4 if (x,?/) = (1,1) or (-1,1) 
1/2 if(x,2/) = (0,0) . 



A simple calculation shows that 



E{XY) = 1/4(1 - 1) + 1/2(0) = 0 



and 

{EX){EY) = {0){l/2) =0 , 

and hence the random variables are uncorrelated. They are not, however, 
independent. For example, Pr(A = 0|F = 0) = 1 while Pr(A = 0) = 1/2. 
As another example, consider the case where px{x) = 1/3 for x = —1, 0, 1 
and Y = X and Y are correlated but not independent. 

Thus uncorrelation does not imply independence. If, however, all pos- 
sible functions of the two random variables are uncorrelated — that is, if 
(4.18) holds — then they must be independent. To see this in the discrete 
case, just consider all possible functions of the form la (a;), that is, indicator 
functions of all of the points. (la(a;) is 1 if a: = a and zero otherwise.) Let 
g = la and h = lb for a in the range space of X and b in the range space 
of Y. It follows from (4.18) and (4.10) that 

Px,Y{a,b) = px{a)pY{b) . 

Obviously the result holds for all a and b. Thus the two random variables 
are independent. It can now be seen that (4.18) provides a necessary and 
sufficient condition for independence, a fact we formally state as a theorem. 
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Theorem 4.3 Two random variables X and Y are independent if and only 
if g{X) and h{Y) are uncorrelated for all functions g and h with finite 
expectations, that is, if (4-18) holds. More generally, random variables 
Xf, i = l,...n are mutually independent if and only if for all functions 
gp, i = 1, .. .n the random variables gi{Xi) are uncorrelated. 

This theorem is useful as a means of showing that two random vari- 
ables are not independent: If we can find any functions g and h such that 
E{g{X)h{Y)) {E g{X)){Eh{Y)) , then the random variables are not inde- 

pendent. The theorem also provides a simple and general proof of the fact 
that the characteristic function of the sum of independent random variables 
is the product of the characteristic functions of the random variables being 
summed. 



Corollary 4.1 Given a sequence of mutually independent random variables 
X \ , X 2 Xji, define 

n 

^ X, . 

i^l 

Then 

n 

{ju) = Mxi (ju) . 

i=l 



Proof. Successive application of theorem 4.3, which states that functions 
of independent random variables are uncorrelated, yields 



I 



E = E 



3U^X, 



\ 



^ n< 



JuXi 



V ) 



Vi=l 



= \{E =\{Mx,{3u) . 






4.5.2 Covariance 

The idea of uncorrelation can be stated conveniently in terms of another 
quantity, which we now define. Given two random variables X and Y, 
define their covariance, COV{X,Y) by 

COV{X, Y) = E[{X - EX){Y - EY)] . 

As you can see, the covariance of two random variables equals the correla- 
tion of the two “centralized” random variables, X — EX and Y — EY, 
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that are formed by subtracting the means from the respective random 
variables. Keeping in mind that EX and EY are constants, it is seen 
that centralized random variables are zero-mean random variables; i.e., 
E{X - EX) = E{X) - E{EX) = EX - EX = D. Expanding the product 
in the definition, the covariance can be written in terms of the correlation 
and means of the random variables. Again remembering that EX and EY 
are constants, we get 

COV{X,Y) = E[XY -YEX - XEY + {EX){EY)\ 

= E{XY) - {EY){EX) - {EX){EY) + {EX){EY) 

= e\xy) - \ex){ey) . 

(4.20) 



Thus the covariance is the correlation minus the product of the means. 
Using this fact and the definition of uncorrelated, we have the following 
statement: 



Corollary 4.2 Two random variables X and Y are uncorrelated if and 
only if their covariance is zero; that is, if COV{X,Y) = 0. 

If we set Y = X, the correlation of X with itself, E{X‘^), results; this 
is called the second moment of the random variable X. The covariance 
COV{X, X) is called the variance of the random variable and is given the 
special notation a\. ax = is called the standard deviation of X. 

From the definition of covariance and (4.19), 

cr^ = E[{X - EA)2] = E(A2) - {EX)^ . 

By the first property of expectation, the variance is nonnegative, yielding 
the simple but powerful inequality 

\EX\ < [E(A2)]1/2 ^ (4,21) 

a special case of the Cauchy -Schwarz inequality (see problem 4.17 with the 
random variable Y set equal to the constant 1). 



4.5.3 Covariance Matrices 

The fundamental theorem of expectation of functions of several random 
variables can also be extended to vector or even matrix functions g of 
random vectors as well. There are two primary examples, the covariance 
matrix treated here and the multivariable characteristic functions treated 
next. 

Suppose that we are given an n-dimensional random vector X = (Aq, Xi, . . . , X„_i). 
The mean vector m = (mo, mi, . . . ,m„_i)* is defined as the vector of the 
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means, i.e., mk = E{Xk) for all fc = 0, 1, . . . , n — 1. This can be written 
more conveniently as a single vector expectation 

m = E{X) = j fx{x)xdx (4.22) 

where the random vector X and the dummy integration vector x are both n 
dimensional and the integral of a vector is simply the vector of the integrals 
of the individual components. Similarly we could define for each k, I = 
0, 1, . . . , n — 1 the covariance Kx{k, 1) = E[{Xk — mk){Xi — mi)] and then 
collect these together to from the covariance matrix 

K = {Kx{k, 1); /c = 0, 1,... ,n— l;/fc = 0,l,... ,n— 1}. 

Alternatively, we can use the outer product notation of linear algebra and 
the fundamental theorem of expectation to write 



K = E[{X — m){X — m)*] = ( {x — m){x — m)* dx, (4.23) 

where the outer product of a vector a with a vector b, ab* , has (fc, j) entry 
equal to a^bj. 

In particular, by straightforward but tedious multiple integration, it can 
be shown that the mean vector and the covariance matrix of a Gaussian 
random vector are indeed the mean and covariance, i.e., using the funda- 
mental theorem 



m 



K 



E{X) 



■ dx 



Jsf" (27r)"/^\/det K 
E[{X - m){X - m)*] 

r ^—l/2{x—myK~^{x—m) 

/ (x — m)(x — m)* ^ — , — dx. 

Ju- (27r)"/2Vd^ 



(4.24) 



(4.25) 



4.5.4 Multivariable Characteristic Functions 

The fundamental theorem of expectation of functions of several random 
variables can also be extended to vector functions g of random vectors as 
well. In fact we implicitly assumed this to be the case in the evlauation of 
the characteristic function of a Gaussian random variable (since is a 
complex function of uj and hence a vector function) and of the multidimen- 
sional characteristic function of a Gaussian random vector in (3.126): if a 
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Gaussian random vector X has a mean vector m and covariance matrix A, 
then 



Mx{ju) 



^ju^m—lj2u^Au 



exp 



n— 1 n— 1 n— 1 

j ^ Uk-nik “ 1/2 X! X! '^kHk, m)um 

. fe =0 fc =0 m =0 



This representation for the characteristic function yields the proof of the 
following important result: 



Theorem 4.4 Let X he a k-dimensional Gaussian random vector with 
mean mx and covariance matrix Ax- Let Y he the new random vector 
formed hy a linear operation of X: 

Y = HX + b , (4.26) 

where H is a n x k matrix and b is an n-dimensional vector. Then Y is a 
Gaussian random vector of dimension n with mean 

my = Hmx + b (4.27) 

and covariance matrix 

Ay = HAxH*. (4.28) 



Proof. The characteristic function of Y is found by direct substitution 
of the expression for Y in terms of X into the definition, a little matrix 
algebra, and (3.126): 



My{ju) = E 
= E 



JVy 



jV(HX+b) 



— 






= e-'-°E 

= C^'^MxijHA) 

— gJM * 

^jA{Hm+b) H\xH*u 



It can be seen by reference to (3.126) that the resulting characteristic 
function is the transform of a Gaussian random vector pdf with mean vector 
Hm + b and covariance matrix HAH^. This completes the proof. 
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The following observation is trivial, but it emphasizes a useful fact. 
Suppose that X is a Gaussian random vector of dimension, say, k, and we 
form a new vector Y by subsampling X, that is, by selecting a subset of the 
(Xq, Xi, . . . , Xk-i), say Yi = t = 0, 1, . . . ,m < k. Then we can write 
Y = AX, where A is a matrix that has = 1 for i = 0, 1, . . . ,m < k 

and O’s everywhere else. The preceeding result implies immediately that Y 
is Gaussian and shows how to compute the mean and covariance. Thus any 
subvector of a Gaussian vector is also a Gaussian vector. This could also 
have been proved by a derived distribution and messy multidimensional 
integrals, but the previous result provides a nice shortcut. 

4.5.5 Example: Differential Entropy of a Gaussian Vec- 
tor 

Suppose that X = {Xq,Xi, . . . ,Xn~i) is a Gaussian random vector de- 
scribed by a pdf fx specified by a mean vector m and a covariance matrix 
Kx. The differential entropy of a continuous vector X is defined by 

h{X) 

fx{x)logfx{x)dx 
= - j /xo,Xi,....Xn-i(2^o,a;i, . . . X 

log fxo.Xi,... ,x„_i (a^o. , a;„_i) dxodxi ■ ■ ■ dx^-i 

(4.29) 

where the units are called “bits” if the logarithm is base 2 and “nats” if 
the logorithm is base e. The differential entropy plays a fundamental role 
in Shannon information theory for continuous alphabet random processes. 
See, for example, Gover and Thomas [?]. It will also prove a very useful 
aspect of a random vector when considering linear prediction or estimation. 
We here use the fundamental theorem of expectation for functions of several 
variables to evaluate the differential entropy h{X) of a Gaussian random 
vector. 

Plugging in the density for the Gaussian pdf and using natural loga- 
rithms results in 

h{X) = - J fx{x)lnfx{x)dx 

= -ln{V^ detif)-!- 

x-m)*X-i(x-m)(27r)-"/2(detX)-i/2e-i/2G-m)‘x-bx-m 
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The final term can be easily evaluated by a trick. From linear algebra we 
can write for any n-dimensional vector a and n x n matrix K 

a*Aa = Tv{Aaa^) (4.30) 

Where Tr is the trace or sum of diagonals of the matrix. Thus using the 
linearity of expectation we can rewrite the previous equation as 

h{X) = iln((27r)"detiF) + 

= iln((27r)"detiF) + ^F;(Tr[iF-i(X-TO)(X-m)*)] 

= i ln((27r)" det K) + ]^ Tr[K-^E {{X - m){X - m)*)] 

= i ln((27r)" det K) + ]^ Tv[K-^K] 

= iln((27T)"detiF) + iTr[/] 

1 71 

= K) + - 

= ^ ln((27re)" det iF) nats. (4-31) 

4.6 Conditional Expectation 

Expectation is essentially a weighted integral or summation with respect 
to a probability distribution. If one uses a conditional distribution, then 
the expectation is also conditional. For example, suppose that {X,Y) is a 
random vector described by a joint pmf px,Y- ordinary expectation of 
Y is defined as usual by EY = ^ypriu)- Suppose, however, that one is 
told that X = X and hence one has the conditional (a posteriori) pmf py|x- 
Then one can define the conditional expectation of Y given X = x hy 

E{Y\x) = yPY\x{y\x) (4.32) 

V&Ay 

that is, the usual expectation, but with respect to the pmf Py\x{'\x). So 
far, this is an almost trivial generalization. Perhaps unfortunately, however, 
(4.32) is not in fact what is usually defined as conditional expectation. 
The actual derivation might appear to be only slightly different, but there 
is a fundamental difference and a potential for confusion because of the 
notation. As we have defined it so far, the conditional expectation of Y 
given A = X is a function of the independent variable x, say g{x). In other 
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words, 



g{x) = E{Y\x). 



If we take any function g(x) of x and replace the independent variable x 
by a random variable X, we get a new random variable g{X). If we simply 
replace the independent variable x in E(Y\x) by the random variable X, 
the resulting quantity is a random variable and is denoted by E(Y\X). It 
is this random variable that is defined as the conditional expectation of Y 
given X. The previous definition E{Y\x) can be considered as a sample 
value of the random variable E{Y\X). Note that we can write the definition 
as 



E{Y\X) = ^ ypY\x{y\X), (4.33) 

V£Ay 

but the reader must beware the dual use of X: in the subscript it denotes 
as usual the name of the random variable, in the argument it denotes the 
random variable itself, i.e., E{Y\X) is a function of the random variable X 
and hence is itself a random variable. 

Since E{Y\X) is a random variable, we can evaluate its expectation 
using the fundamental theorem of expectation. The resulting formula has 
wide application in probability theory. Taking this expectation we have 
that 



E[E{Y\X)\ 



^ px{x)E{Y\x) 
x£Ax 

ypy\x{y\x) 

x£Ax y^Ay 

Px,Y(x,y) 

y^Ay xGAx 

E ypy(y) 

V&Ay 

EY, 



a result known as iterated expectation or nested expectation. Roughly speak- 
ing it states that if we wish to find the expectation of a random variable Y, 
then we can first find its conditional expectation with respect to another 
random variable, E{Y\X), and then take the expectation of the resulting 
random variable to obtain 



EY = E[E{Y\X)]. (4.34) 

In the next section we shall see an interpretation of conditional expec- 
tation as an estimator of one random variable given another. A simple 
example now, however, helps point out how this result can be useful. 
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Suppose that one has a random process {Xk; A: = 0, 1, . . . }, with identi- 
cally distributed random variables X„, and a random variable N that takes 
on positive integer values. Suppose also that the random variables are 
all independent of N . Suppose that one defines a new random variable 



N-l 

Y=Y.Xu, 



fc =0 



that is, the sum of a random number of random variables. How does one 
evaluate the expectation EYl Finding the derived distribution is daunting, 
but iterated expectation comes to the rescue. Iterated expectation states 
that EY = E[E(Y\N)], where E(Y\N) is found by evaluating E(Y\n) and 
replacing n by N. But given N = n, the random variable Y is simply 
Y = since the distribution of the X^ is not affected by the fact 

that N = k since the Xk are independent of N. Hence by the linearity of 
expectation, 

n—1 

E{Y\n) = J2EXk, 

k=0 

where the identically distributed assumption implies that the EXk are all 
equal, say EX. Thus E{Y\n) = nEX and hence E(Y\N) = NEX. Then 
iterated expectation implies that 

EY = E{NEX) = (EN){EX), (4.35) 

the product of the two means. Try finding this result without using iterated 
expectation. As a particular example, if the random variables are Bernoulli 
random variables with parameter p and N has a Poisson distribution with 
parameter A, then Pr(Xi = 1) = p for all i and EN = A and hence then 
EY = pX. 

Iterated expectation has a more general form. Just as constants can 
be pulled out of ordinary expectations, quantities depending only on the 
variable conditioned on can be pulled out of conditional expectations. We 
state and prove this formally. 

Lemma 4.2 General Iterated Expectation 

Suppose the X,Y are discrete random variables and that g{X) and 
h{X,Y) are functions of these random variables. Then 



E[g{X)h{X, r)] = E {g{X)E[h{X, F)|A]) . 



(4.36) 
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Proof: 



E[g{X)h{X,Y)] 



^ g{x)h{x, y)px,Y{x, y) 

x,y 

'^Px{x)g{x)'^h{x,y)pY\x{y\x) 

X y 

'^Px{x)g{x)E[h{X, y)|a;] 

X 

E{g{X)E[h{X,Y)\X]). 



As with ordinary iterated expectation, this is primarily an interpretation 
of an algebraic rewriting of the definition of expectation. Note that if we 
take g{x) = x and h{x,y) = 1, this general form reduces to the previous 
form. 

In a similar vein, one can extend the idea of conditional expectation to 
continuous random variables by using pdf’s instead of pmf’s. For example. 



E{Y\x) = J yfY\x{y\x) dy, 

and E{Y\X) is defined by replacing a; by A in the above formula. Both 
iterated expectation and its general form extend to this case by replacing 
sums by integrals. 



4.7 A Jointly Gaussian Vectors 

Gaussian vectors provide an interesting example of a situation where con- 
ditional expectations can be explicitly computed, and this in turn provides 
additional fundamental, if unsurprising, properties of Gaussian vectors. In- 
stead of considering a Gaussian random vectory X = (Aq, X\, . . . , A^r-i)*, 
say, consider instead a random vector 




formed by concatening two vectors A and Y of dimensions, say, k and m, 
respectively. For this section we will drop the boldface notation for vectors. 
If U is Gaussian, then we say that A and Y are jointly Gaussian. From 
Theorem 4.4 it follows that if A and Y are jointly Gaussian, then they 
are individually Gaussian with, say, means mx and mv, respectively, and 
covariance matrices Kx and Ay, respectively. The goal of this section is 
to develop the conditional second order moments for Y given A and to 
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show in the process that given X, Y has a Gaussian density. Thus not only 
is any subcollection of a Gaussian random vector Gaussian, it is also true 
that the conditional densities of any subvector of a Gaussian vector given 
a disjoint subvector of the Gaussian vector is Gaussian. This generalizes 
(3.61) from two jointly Gaussian scalar random variables to two jointly 
Gaussian random vectors. The idea behind the proof is the same, but the 
algebra is messier in higher dimensions. 

Begin by writing 



Ku 



E[UU*] 

Y-Zy (Y-myf)] 

E[{X - mx){,X - mxY] E[{X - mx)(Y - mv)*] 
E[{Y - mY){X - mxY] E[{Y - my)(^ - 



Kx Kxy 
Kyx Ky ’ 



(4.37) 



where Kx and Ky are ordinary covariance matrices and Kxy = K^x 
are called cross-covariance matrices. We shall also denote Kjj by K(^x,y)^ 
where the subscript is meant to emphasize that it is the covariance of the 
cascade vector of both X and Y in distinction to Kxy, the cross covariance 
of X and Y. 

The key to the recognizing the conditional moments and densities is 
the following admittedly unpleasant matrix equation, which can be proved 
with a fair amount of brute force linear algebra: 



Kx Kxy _ 

Kyx Ky 

■ K]Y + K-^KxyK-^x^yxKx^ -K-^KxyK~1^ 
-k-IxKyxK^^ k~I^ 



,(4.38) 



where 



Ky\x = Ky — KyxK^^Kxy- (4.39) 



The determined reader who wishes to verify the above should do the block 
matrix multiplication 



a b 
c d 



A 



K 



X 



+ K-^KxyK~IxKyxKx^ 

-K-^xKyxK^^ 



^—1 i 



K 



— 1 
^Y\X 



Y\X 



Kx Kxy 
Kyx Ky 
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and show that a is a, k x k identity matrix, d is an m x m identity matrix, 
and that c and d contain all zeros so that the right hand matrix is indeed 
an identity matrix. 

The conditional pdf for Y given X follows directly from the definitions 
as 



fY\x{y\x) 

fxY{x,y) 



fY{y) 

(27r)-(fe+'«)/2(det exp 



-l/2{{x -mxY {y-mYY)K^\ I -Z"" ^ 

y iiiY 



(27r)“''/2(det exp (— l/2(x — mxY^x^(^ ~ ^x)) 

det Kx 

exp [-l/2((a: - mxY iv ~ _Z^ ) + (x ~ mxYKx^{x ~ mx) 

\ y iiiY 



Again using some brute force linear algebra, it can be shown that the 
quadratic terms in the exponential can be expressed in the form 



{{x - mxY , {y - rriYY)Kjj^ y - my ^ ~ mxYKx^{x ~ mx) 

= {y-mY - KyxK]Y{x - mx))*K~^^{y - tuy - KyxK^^{x - mx))- 



Defining 



mY\x = my + KyxK^^{x - mx) 



(4.40) 



the conditional density simplifies to 

fY\x{y\x) = X exp (-l/2(y - mY\xYK-)^xYj ~ mY\x)) , 

(4.41) 



which shows that conditioned on X = x, Y has a Gaussian density. This 
means that we can immediately recognize the conditional expectation of Y 
given X as 

E{Y\X = x) = mY\x = mY + KyxK^^{x - mx), (4.42) 

so that the conditional expectation is an affine function of the vector x. We 
can also infer from the form that Ky\x is the (conditional) covariance 

Ky\x = E[{Y - E{Y\X = x)){Y - E{Y\X = x)Y\x\, (4.43) 
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which unlike the conditional mean does not depend on the vector x\ Fur- 
thermore, since we know how the normalization must relate to the covari- 
ance matrix, we have that 



det(iFy|x) 



det{Ku) 
det{Kx) ' 



(4.44) 



These relations completely describe the conditional densities of one sub- 
vector of a Gaussian vector given another subvector. We shall see, however, 
that the importance of these results goes beyond the above evaluation and 
provides some fundamental results regarding optimal nonlinear estimation 
for Gaussian vectors and optimal linear estimation in general. 



4.8 Expectation as Estimation 

Suppose that one is asked to guess the value that a random variable Y 
will take on, knowing the distribution of the random variable. What is the 
best guess or estimate, say Y7 Obviously there are many ways to define 
a best estimate, but one of the most popular ways to define a cost or 
distortion resulting from estimating the “true” value of T by F is to look 
at the expected value of the square of the error Y — Y, E[{Y — F)^], the so 
called mean squared error or MSE. Many arguments have been advanced in 
support of this approach, perhaps the simplest being that if one views the 
error as a voltage, then the average squared error is the average energy in the 
error. The smaller the energy, the weaker the signal in some sense. Perhaps 
a more honest reason for the popularity of the measure is its tractability in 
a wide variety of problems, it often leads to nice solutions that indeed work 
well in practice. As an example, we show that the optimal estimate of the 
value of an unknown random variable is in fact the mean of the random 
variable, a result that is highly intuitive. Rather than use calculus to prove 
this result — a tedious approach requiring setting derivatives to zero and 
then looking at second derivatives to verify that indeed the stationary point 
is a minimum — we directly prove the global optimality of the result. 
Suppose that that our estimate is F = a, some constant. We will show that 
this estimate can never have mean squared error smaller than that resulting 
from using the expected value of F as an estimate. This is accomplished 
by a simple sequence of equalities and inequalities. Begin by adding and 
subtracting the mean, expanding the square, and using the second and 
third properties of expectation as 

E[{Y-af] = E[{Y - EY + EY - af] 

= E[{Y - EYf] + 2E[{Y - EY){EY - a)] -k {EY - af . 
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The cross product is evaluated using the linearity of expectation and the 
fact that EY is a constant as 

E[{Y - EY){EY - o)] = {EYf - aEY - {EYf + aEY = 0 

and hence from Property 1 of expectation, 

E[{Y - a)2] = E[{Y - EYf] + {EY - af > E[{Y - EYf ] , (4.45) 

which is the mean squared error resulting from using the mean of Y as 
an estimate. Thus the mean of a random variable is the minimum mean 
squared error estimate (MMSE) of the value of a random variable in the 
absence of any a priori information. 

What if one is given a priori information? For example, suppose that 
now you are told that X = x. What then is the best estimate of Y, say 
Y{Xf This problem is easily solved by modifying the previous derivation 
to use conditional expectation, that is, by using the conditional distribution 
for Y given X instead of the a priori distribution for Y. Once again we try 
to minimize the mean squared error: 

E[{Y-Y{X)f] = E(^E[{Y-Y{X)f]X]) 

X 

Each of the terms in the sum, however, is just a mean squared error be- 
tween a random variable and an estimate of that variable with respect to 
a distribution, here the conditional distribution py\x{'\x). By the same 
argument as was used in the unconditional case, the best estimate is the 
mean, but now the mean with respect to the conditional distribution, i.e., 
E{Y]x). In other words, for each x the best Y {x) in the sense of minimizing 
the mean squared error is E{Y]x). Plugging in the random variable X in 
place of the dummy variable x we have the following interpretation 

The conditional expectation E{Y]X) of a random variable 
Y given a random variable X is the minimum mean squared 
estimate of Y given X. 

A direct proof of this result without invoking the conditional version 
of the result for unconditional expectation follows from general iterated 
expectation. Suppose that g{X) is an estimate of Y given X. Then the 
resulting mean squared error is 

E[{Y-g{X)f] = E[{Y-E{Y]X) + E{Y]X)-g{X)f] 

= E[{Y - E{Y]X)f] 

-2E[{Y - E{Y]X)){E{Y]X) - g{X))] 
+E[{E{Y]X) - g{X)f]. 
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Expanding the cross term yields 

E[(Y - E(YIX))(E(YIX) - g(X))] = E[YE{Y\X)] - E[Yg{X)] 

-E[E{Y\X)^] + E[E{Y\X)g{X)] 

From the general iterated expectation (4.36), E[Y E(Y\X)] = E[E(F|X)^] 
(setting g{X) of the lemma to E(Y\X) and h{X, Y) = Y) and E[Y g{X)] = 
E[E{Y\X)g{X)] (setting g{X) of the lemma to the g{X) used here and 
h{X,Y) = Y). 

As with ordinary expectation, the ideas of conditional expectation can 
be extended to continuous random variables by substituting conditional 
pdf’s for the unconditional pdf’s. As is the case with conditional probabil- 
ity, however, this constructive definition has its limitations and only makes 
sense when the pdf’s are well defined. The rigorous development of condi- 
tional expectation is, like conditional probability, analogous to the rigorous 
treatment of the Dirac delta, it is defined by its behavior underneath the in- 
tegral sign rather than by a construction. When the constructive definition 
makes sense, the two approaches agree. 

One of the unfortunately rare examples for which conditional expec- 
tations can be explicitly evaluated is the case of jointly Gaussian random 
variables. In this case we can immediately identify from (3.61) that 

E[Y\X] = my + p{o-Y /cyx){N — mx)- (4.46) 

It will prove important that this is in fact an affine function of X. 

The same ideas extend from scalars to vectors. Suppose we observe a 
real-valued column vector X = {Xq,--- ,Xk-iY and we wish to predict 
or estimate a second random vector Y = (Tq)' '' Note that the 

dimensions of the two vectors need not be the same. 

The prediction Y = Y{X) is to be chosen as a function of X which 
yields the smallest possible mean squared error, as in the scalar case. The 
mean squared error is defined as 

e2(i>) = e{\\Y -Y f) = E[(Y -Yf(Y -Y)] 

m—1 

= (4.47) 

i=0 

An estimator or predictor is said to be optimal within some class of predic- 
tors if it minimizes the mean squared error over all predictors in the given 
class. 

Two specific examples of vector estimation are of particular interest. 
In the first case, the vector X consists of k consecutive samples from a 
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stationary random process, say X = {Xn-i, X„- 2 , ■ ■ ■ , Xn-k) and Y is the 
next, or “future”, sample Y = X„. In this case the goal is to find the 
best one-step predictor given the finite past. In the second example, Y is 
a rectangular subblock of pixels in a sampled image intensity raster and X 
consists of similar subgroups above and to the left of Y . Here the goal is 
to use portions of an image already coded or processed to predict a new 
portion of the same image. This vector prediction problem is depicted in 
Figure 4.1 where subblocks A, B, and C would be used to predict subblock 
D. 



A 


B 


C 


D 



Figure 4.1: Vector Prediction of Image Subblocks 

The following theorem shows that the best nonlinear predictor of Y 
given X is simply the conditional expectation of Y given X. Intuitively, our 
best guess of an unknown vector is its expectation or mean given whatever 
observations that we have. This extends the interpretation of a conditional 
expectation as an optimal estimator to the vector case. 

Theorem 4.5 Given two random vectors Y and X, the minimum mean 
squared error estimate of Y given X is 

Y{X) = E{Y\X). (4.48) 

Proof: As in the scalar case, the proof does not require calculus or 
Lagrange minimizations. Suppose that Y is the claimed optimal estimate 
and that Y is some other estimate. We will show that Y must yield a mean 
squared error no smaller than does Y . To see this consider 

e2(f) = E{\\Y-Yf) = E{\\Y-Y + Y-Y\\^) 

= E{\\Y - Ff ) + E{\\Y - yf ) + 2E[{Y - Y)\Y - Y)] 

> €^{Y) + 2E[{Y -YY(Y -Y)]. 

We will prove that the rightmost term is zero and hence that e^(F) > e^(F), 
which will prove the theorem. Recall that Y = E{Y\X) and hence 

E[{Y-Y)\X] = 0. 
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Since Y — Y is a, deterministic function of X, 

E[{Y -Y)\Y -Y)\X] = 0. 

Then, by iterated expectation applied to vectors, we have 

E{E[{Y - YY{Y - y)|X]) = E[{Y - YY{Y - F)] = 0 

as claimed, which proves the theorem. 

As in the scalar case, the conditional expectation is in general a difficult 
function to evaluate with the notable exception of jointly Gaussian vectors. 
Recall that (4.41)-(4.44) the conditional pdf for jointly Gaussian vectors Y 
and X with K(x y) = E[((X*, Y*) — (mV — mb))*((A‘, F*) — (mV —mV))], 

Kr = £[(r-‘„V)(r VV).|, iV=V[(AVVl)(X-VVl, aV = 

E[{X - mx)(Y - my)*], Kyx = E[(Y - my^Y - my)*] is 





fy\x{y\x) = ( 27 t) "*/^(det(A:y|x)) x 

exp {-l/2{y - my\^YKY\x^y ~ my\^)^ , 


, (4.49) 


where 


Ky\x = Ky-KyxKiYKxy 

= E[{Y - E{Y\X)){Y - E{Y\X)Y\X], 


(4.50) 


and 


det(Xy|^)- , 


(4.51) 




E{Y\X = x)= my\^ = my + KyxK^^{x - mx), 


(4.52) 


and hence the minimum mean square estimate of Y given X is 






E{Y\X) =my + KyxK^\X - mx ) , 


(4.53) 



which is an affine (linear plus constant) function of X\ The resulting mean 
squared error is (using iterated expectation) 

E[{Y - E{Y\X)Y{Y - E{Y\X))] (4.54) 

= E{E[{Y-E{Y\X)Y{Y-E{Y\X))\X]) 

= E (A[Tr[(F - E{Y\X)){Y - E{Y\X)Y]\X]) 

= Tt{Ky\x)- 



(4.55) 
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In the special case where X = X" = {Xq, Xi, . . . ,Xn-i) and Y = 
Xn, the so called one-step linear prediction problem, the solution takes an 
interesting form. For this case define the nth order covariance matrix as 
the n X n matrix 

= E[{X'^ - F;(X”))(X” - E{X^)f], (4.56) 

i.e., the (fc,j) entry of is E[{Xk - E{Xk)){Xj - E{Xj))], k,j = 

0, 1,... ,n - 1. Then if is Gaussian, the optimal one-step predic- 

tor for Xn given X" is 

X„(X”) = E{Xn) + 

E[{Xn - E{Xn)){X^ - E{X^)Y]{K^^^)-\X^ - F;(X”)) (4.57) 
which has an affine form 

Xn{X^) = AX^ + b (4.58) 

where 

A = rYKP)-\ (4.59) 

/ Kx{n,0) \ 

Kx{n,l) 

r = 

\ Kx{n,n- 1) / 

and 

b = E{Xn)-AE{X^). 

The resulting mean squared error is 

MMSE = E[{Xn - Xn{X^)Y] 

= Tv{Ky-KyxKx^Kxy) 

= 4 . 

or 

MMSE = E[{Xn - X„(X"))2] = 
which from (4.51) can be expressed as 

det(j^^"^) 
det(iF^"-'V 



(4.60) 



(4.61) 



(4.62) 



MMSE = 



(4.63) 
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a classical result from minimum mean squared error estimation theory. 

If the Xn are samples of a weakly stationary random process with zero 
mean, then this simplifies to 



where r is the n-dimensional vector 

/ Kx(n) \ 
Kx(n-l) 



V ^x(l) ) 



( 4 . 64 ) 



( 4 . 65 ) 



4.9 X Implications for Linear Estimation 

The development of optimal mean squared estimation for the Gaussian 
case provides a prevue and an approach to the problem of optimal mean 
squared estimation for the situation of completely general random vectors 
(not necessarily Gaussian) where only linear or affine estimators are allowed 
(to avoid the problem of possibly intractable conditional expectations in the 
nonGaussian case). This topic will be developed in some detail in a later 
section, but the key results will here be shown to follow directly from the 
Gaussian case by reinterpreting the results. 

The key fact is that the optimal estimator for a vector Y given a vector 
X when the two are jointly Gaussian was found to be an affine estimator, 
that is, to have the form 

Y{X) = AX + b. 

Since it was found the lowest possible MMSE over all possible estimators 
was achieved by an estimator of this form with A = KyxK^^ and b = 
E(Y) + AE{X) with a resulting MSE of MMSE = Tt{Ky-KyxK^^Kxy), 
then it is obviously true that this MMSE must be the minimum achievable 
MSE over all affine estimators, i.e., that for all A: x m matrices A and 
TO-dimensional vectors b it is true that 

MMSE(A, b) = Tr ((E - AX - b){Y - AX - bf) 

> Tt{Ky - KyxK^^Kxy) ( 4 . 66 ) 

and that equality holds if and only if ^ = KyxK^^ and b = E{Y) + 
AE{X). We shall now see that this version of the result has nothing to 
do with Gaussianity and that the inequality and solution are true for any 
distribution (providing of course that Kx is invertible). 
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Expanding the MSE and using some linear algebra results in 
MMSE(A, b) 

= Tr {{Y - AX - b){Y - AX - bf) 

= Tr {{Y — niY + A(X — mx) —b + my + Anix) 

X (F — my + A(X — mx) — b + my + Amx)*) 

= Tr {Ky - AKxy - KyxA* + AKxA^) 

+{b — my — AmxY{b — my — Amx) 

where all the remaining cross terms are zero. Regardless of A the final term 
is nonnegative and hence it is bound below by 0, a minimum achieved by 
the choice 

b = my + Amx- (4-67) 

Thus the inequality we wish to prove becomes 

Tr {Ky - AKxy ~ KyxA^ + AKxA^) > Tr{Ky - KyxK^^Kxy) 

(4.68) 

or 

Tr {KyxKx^Kxy + AKxA* - AKxy ~ KyxA*) > 0. (4.69) 

Since Kx is a covariance matrix it is Hermitian and since it has an inverse, 

1/2 

it must be positive definite. Hence it has a well defined squareroot K^ 
(see Section A. 4) and hence 

Tr ((AK^" - KyxK-^'^){AK]l^ - KyxK~^^y^ (4.70) 

(just expand this expression to verify it is the same as the previous ex- 
pression). But this has the form Tr{BB*) which is just which is 

nonnegative, proving the inequality. Plugging in A = KyxK^^ achieves 
the lower bound with equality. 

We summarize the result in the following theorem. 

Theorem 4.6 Given random vectors X and Y with K^x.y) = E[{{X* , Y*) — 
(m^ ~ ’^y))*((Ai*, F‘) — (m^ — my))], Ky = E[(Y — my){Y — my)*], 
Kx = E[{X - mx)(X - mx)*], Kxy = E[{X - mx){Y - my)*], Kyx = 
E[(Y — my)(Y — my)*], assume that Kx is invertible (e.g., it is positive 
definite). Then 

min MMSE{A, b) = min Tr ((F - AA - b) {Y - AX - b)*) 

= Tr{Ky - KyxKf^^Kxy) (4.71) 

and the minimimum is achieved by A = KyxKf)^ and b = E{Y) + AE{X) . 
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In particular, this result does not require that the vectors be jointly 
Gaussian. 

As in the Gaussian case, the results can be specialized to the situation 
where Y = X„ and X = X” and {X„} is a weakly stationary process to 
obtain that the optimal linear estimator of X„ given (Xq, . . . , X„_i) in the 
sense of minimizing the mean squared error is 

X„(X”) = r*(KP)-^X’^, (4.72) 

where r is the n-dimensional vector 

/ Kx(n) 

Kx(n-l) 

r = 

V Kx{l) J 



(4.73) 



The resulting minimum mean squared error (called the “linear least squares 
error”) is 



LLSE = cr| - r\KP)-^r 
detjKP) 
det{K^r'^) 



(4.74) 

(4.75) 



a classical result of linear estimation theory. Note that the equation with the 
determinant form does not require a Gaussian density, although a Gaussian 
density was used to identify the first form with the deternminant form (both 
being in the Gaussian case). 



4.10 Correlation and Linear Estimation 

As an example of the application of correlations, we consider a constrained 
form of the minimum mean squared error estimation problem that provided 
an application and interpretation for conditional expectation. A problem 
with the earlier result is that in some applications the conditional expecta- 
tion will be complicated or unknown, but the simpler correlation might be 
known or at least one can approximate it based on observed data. While 
the conditional expectation provides the optimal estimator over all possible 
estimators, the correlation turns out to provide an optimal estimator over 
a restricted class of estimators. 

Suppose again that the value of X is observed and that a good estimate 
of Y, say Y{X) is desired. Once again the quality of an estimator will be 
measured by the resulting mean squared error, but this time we do not 
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allow the estimator to be an arbitrary function of the observation, it must 
be a linear function of the form 

Y{x) = ax + b, (4-76) 

where a and b are fixed constants which are chosen to minimize the mean 
squared error. Strictly speaking, this is an affine function rather than a 
linear function, it is linear if 6 = 0. The terminology is common, however, 
and we will use it. 

The goal now is to find a and b which minimizes 

E[(Y - Y{X)f] = E[{Y -aX- b)% (4.77) 

Rewriting the formula for the error in terms of the mean-removed random 
variables yields for any a, b: 

E{[Y-{aX + b)r) 

= E ([(y - EY) - a{X - EX) - {b - EY + aEX)f) 

= a^ + aV| + {b-EY + aEX)'^ - 2aCOV{X, Y) 

since the remaining cross products are all zero (why? ) . Since the first term 
does not depend on a or b, minimizing the mean squared error is equivalent 
to minimizing 



aV| + {b-EY + aEX)^ - 2aCOV{X, Y). 



First note that the middle term is nonnegative. Once a is chosen, this term 
will be minimized by choosing b = EY — aEX, which makes this term 0. 
Thus the best a must minimize a^cr^ — 2aCOV{X,Y). A little calculus 
shows that the minimizing a is 



COV{X, Y) 



(4.78) 



and hence the best b is 



b = EY - ^x. (4.79) 

The connection of second order moments and linear estimation also 
plays a fundamental role in the vector analog to the problem of the previous 
section, that is, in the estimation of a vector Y given an observed vector 
X. The details are more complicated, but the basic ideas are essentially 
the same. 
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Unfortunately the conditional expectation is mathematically tractable 
only in a few very special cases, e.g., the case of jointly Gaussian vectors. 
In the Gaussian case the conditional expectation given X is formed by a 
simple matrix multiplication on X with possibly a constant vector being 
added; that is, the optimal estimate has a linear form. (As in the scalar case, 
technically this is an ajjine form and not a linear form if a constant vector is 
added.) Even when the random vectors are not Gaussian, linear predictors 
or estimates are important because of their simplicity. Although they are 
not in general optimal, they play an important role in signal processing. 
Hence we next turn to the problem of finding the optimal linear estimate 
of one vector given another. 

Suppose as before that we are given an fc-dimensional vector X and 
wish to predict an m-dimensional vector Y . We now restrict ourselves to 
estimates of the form 

Y = AX, 

where the m x /c-dimensional matrix A can be considered as a matrix of 
fc-dimensional row vectors a^; k = 0, - ■ ■ ,m — 1: 

A= [ao,02,--- 

so that if y = (lo, • • • 5 Ym-iY, then 

V = a\x 



and hence 



k 

e\Y) = Y,E[{V-a\Xf]. (4.80) 

The goal is to find the matrix A that minimizes e^, which can be con- 
sidered as a function of the estimate Y or of the matrix A defining the 
estimate. We shall provide two separate solutions which are almost, but 
not quite, equivalent. The first is constructive in nature: a specific A will 
be given and shown to be optimal. The second development is descriptive: 
without actually giving the matrix A, we will show that a certain property 
is necessary and sufficient for the matrix to be optimal. That property is 
called the orthogonality principle, and it states that the optimal matrix is 
the one that causes the error vector Y — Y to be orthogonal to (have zero 
correlation with) the observed vector X . The first development is easier to 
use because it provides a formula for A that can be immediately computed 
in many cases. The second development is less direct and less immediately 
applicable, but it turns out to be more general: the descriptive property can 
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be used to derive A even when the first development is not applicable. The 
orthogonality principal plays a fundamental role in all of linear estimation 
theory. 

The error e'^(A) is minimized if each term E[(Yi — is minimized 

over Ui since there is no interaction among the terms in the sum. We can do 
no better when minimizing a sum of such positive terms than to minimize 
each term separately. Thus the fundamental problem is the following sim- 
pler one: Given a random vector X and a random variable (one-dimensional 
or scalar vector) Y, we seek a vector a that minimizes 

e‘^{a) = E[(Y -a*Xf]. (4.81) 

One way to find the optimal a is to use calculus, setting derivatives of 
e^(a) to zero and verifying that the stationary point so obtained is a global 
minimum. As previously discussed, variational techniques can be avoided 
via elementary inequalities if the answer is known. We shall show that the 
optimal a is a solution of 



a*Rx = E{YX*), (4.82) 

so that if the autocorrelation matrix defined by 

Rx = E[XX^] = {Rx{k,i) = E{XkXi)- fc, z = 0, • • • , fc - 1} 

is invertible, then the optimal a is given by 

a* = E(YX*)R^\ (4.83) 

To prove this we assume that a satisfies (4.83) and show that for any other 
vector b 



e^b) > e^{a). 



(4.84) 



To do this we write 

£2(5) = E[{Y -b*Xf] = E[{Y -a^X + a*X -b^X)^] 

= E[(Y - a*Xf] + 2E[(Y - a‘A)(a*X - b*X)] 

+ E[{a*X -b*Xf]. 

Of the final terms, the first term is just e^(a) and the rightmost term is 
obviously nonnegative. Thus we have the bound 

e^(b) > e^(a) -k 2E[{Y - a*X){a* - b^)X]. (4.85) 
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The crossproduct term can be written as 



2El(V - a*X)(a* - b*)X] 



2E[(Y-a*X)X*(a-b)] 

2E[(V - a^X)X^](a- b) 

2 {E[YX*] - a*E[XX^]) (a - b) 

2 {E[YX*] - a^Rx) (a - b) 

0 (4.86) 



invoking (4.82). Combining this with (4.85) proves (4.84) and hence opti- 
mality. Note that because of the symmetry of autocorrelation matrices and 
their inverses, we can rewrite (4.83) as 

a = Rx^E[YX], (4.87) 

Using the above result to perform a termwise minimization of (4.80) now 
yields the following theorem describing the optimal linear vector predictor. 



Theorem 4.7 The minimum mean squared error linear predictor of the 
form Y = AX is given by any solution A of the equation: 

ARx = E{YX^). 

If the matrix Rx is invertible, then A is uniquely given by 

A* = Rf}E[XY^], 

that is, the matrix A has rows a\] i = 0, 1, . . . ,m, with 

a, = Rf}E%X]. 

Alternatively, 

A = E\YX^\Rx^. (4.88) 

Having found the best linear estimate, it is easy to modify the develop- 
ment to find the best estimate of the form 

Y{X) = AX + b, (4.89) 

where now we allow an additional constant term. This is also often called 
a linear estimate, although as previously noted it is more correctly called 
an affine estimate because of the extra constant vector term. As the end 
result and proof strongly resemble the linear estimate result, we proceed 
directly to the theorem. 
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Theorem 4.8 The minimum mean squared estimate of the form Y = 
AX + b is given by any solution A of the equation: 

AKx = E[{Y - E{Y)){X - E{X)f] (4.90) 

where the covariance matrix Kx is defined by 

Kx = E[{X - E{X)){X - E{X)f] = Rx-e(x), 

and 

b = E{Y) - AE{X). 

If Kx is invertible, then 

A = E[(Y - E(Y))(X - E(X)y]Kf^\ (4.91) 

Note that if X and Y have zero means, then the result reduces to the 
previous result; that is, affine predictors offer no advantage over linear 
predictors for zero mean random vectors. To prove the theorem, let C be 
any matrix and d any vector (both of suitable dimensions) and note that 

E(HY-(CX + d)in 

= E{\\{Y - E{Y)) - C{X - E{X)) + E{Y) - CE{X) - df) 

= E{\\{Y-E{Y))-C{X-E{X))f) 

+ E{\\E{Y)-CE{X)-d\y) 

+ 2E[Y - E{Y) - C{X - E{X))f[E{Y) - CE{X) - d\. 

From Theorem 4.7, the first term is minimized by choosing C = A, 
where A is a solution of (4.90); also, the second term is the expectation of 
the squared norm of a vector that is identically zero if C = A and d = b, 
and similarly for this choice of C and d the third term is zero. Thus 

E{\\Y - {CX + d)f ) > E{\\Y - {AX + 6)f ). 

□ 

We often restrict interest to linear estimates by assuming that the var- 
ious vectors have zero mean. This is not always possible, however. For 
example, groups of pixels in a sampled image intensity raster can be used 
to predict other pixel groups, but pixel values are always nonnegative and 
hence always have nonzero means. Hence in some problems affine predictors 
may be preferable. Nonetheless, we will often follow the common practice 
of focusing on the linear case and extending when necessary. In most stud- 
ies of linear prediction it is assumed that the mean is zero, i.e., that any 
dc value of the process has been removed. If this assumption is not made, 
linear estimation theory is still applicable but will generally give inferior 
performance to the use of affine prediction. 
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The Orthogonality Principle 

Although we have proved the form of the optimal linear predictor of one 
vector given another, there is another way to describe the result that is 
often useful for deriving optimal linear predictors in somewhat different 
situations. To develop this alternative viewpoint we focus on the error 
vector 

e = Y -Y. (4.92) 

Rewriting (4.92) as y = Y + e points out that the vector Y can be consid- 
ered as its estimate plus an error or “noise” term. The goal of an optimal 
predictor is then to minimize the error energy e*e = X)n=o esti- 

mate is linear, then 

e = r - AX. 

As with the basic development for the linear predictor, we simplify 
things for the moment and look at the scalar prediction problem of pre- 
dicting a random variable Y hy Y = a*X yielding a scalar error of e = 
Y — Y = Y — a*X. Since we have seen that the overall mean squared error 
E[e^e] in the vector case is minimized by separately minimizing each com- 
ponent E[e1], we can later easily extend our results for the scalar case to 
the vector case. 

Suppose that a is chosen optimally and consider the crosscorrelation 
between an arbitrary error term and the observable vector: 

E[{Y-Y)X] = E[(Y-a*X)X] 

= E[YX] - E[X{X^a)] 

= E[YX] - Rxa = 0 

using (4.82). 

Thus for the optimal predictor, the error satisfies 

E[eX] = 0, 



or, equivalently, 

A[eX„] = 0; n = 0, • • • , /c. (4.93) 

When two random variables e and X are such that their expected prod- 
uct E{eX) is 0, they are said to be orthogonal and we write 

eXX. 

We have therefore shown that the optimal linear estimate of a scalar random 
variable given a vector of observations causes the error to be orthogonal to 
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all of the observables and hence orthogonality of error and observations is 
a necessary condition for optimality of a linear estimate. 

Conversely, suppose that we know a linear estimate a is such that it 
renders the prediction error orthogonal to all of the observations. Arguing 
as we have before, suppose that b is any other linear predictor vector and 
observe that 



e^{b) = E[{Y-b*Xf] 

= E[(Y-a*X + a*X-b*Xf] 

> e‘^{a) + 2E[(Y -a*X){a^X -b*X)], 



where the equality holds if & = a. Letting e = Y — a*X denote the error 
resulting from an a that makes the error orthogonal with the observations, 
the rightmost term can be rewritten as 

2E[e{a^X - b^X)] = 2(a* - b^)E[eX] = 0. 

Thus we have shown that e^(&) > e^(o) and hence no linear estimate can 
outperform one yielding an error orthogonal to the observations and hence 
such orthogonality is sufficient as well as necessary for optimality. 

Since the optimal estimate of a vector Y given X is given by the com- 
ponentwise optimal predictions given X, we have thus proved the following 
alternative to Theorem 4.7. 

Theorem 4.9 The Orthogonality Principle: 

A linear estimate Y = AX is optimal (in the in the mean squared 
error sense) sense) if and only if the resulting errors are orthogonal to the 
observations, that is, if e = Y — AX, then 

E[ekXn] =0; fc = 1, • • • , AT; n = 1, • • • ,N. 

4.11 Correlation and Covariance Functions 

We turn now to correlation in the framework of random processes. The 
notion of an iid random process can be generalized by specifying the com- 
ponent random variables to be merely uncorrelated rather than indepen- 
dent. Although requiring the random process to be uncorrelated is a much 
weaker requirement, the specification is sufficient for many applications, as 
will be seen in several ways. In particular, in this chapter, the basic laws 
of large numbers require only the weaker assumption and hence are more 
general than they would be if independence were required. To define the 
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class of uncorrelated processes, it is convenient to introduce the notions of 
autocorrelation functions and covariance functions of random processes. 

Given a random process {Xf, t € T}, the autocorrelation function 
Rx{t, s); t, s G T is defined by 

Rx{t, s) = E(XtXg) ; all t, s G T . 

The autocovariance function or simply the covariance function Kx{t,s); 
t,s,GT is defined by 



Kx{t,s) = COV{Xt,X,) . 

Observe that (4.19) relates the two functions by 

Kx{t, s) = Rx{t, s) - {EXt){EX,) . (4.94) 



Thus the autocorrelation and covariance functions are equal if the process 
has zero mean, that is, if EXt = 0 for all t. The covariance function of a 
process {Xt} can be viewed as the autocorrelation function of the process 
{Xt — EXt} formed by removing the mean from the given process to form 
a new process having zero mean. 

The autocorrelation function of a random process is given by the cor- 
relation of all possible pairs of samples; the covariance function is the co- 
variance of all possible pairs of samples. Both functions provide a measure 
of how dependent the samples are and will be seen to play a crucial role 
in laws of large numbers. Note that both definitions are valid for random 
processes in either discrete time or continuous time and having either a 
discrete alphabet or a continuous alphabet. 

In terms of the correlation function, a random process {Xt; t G T} is 
said to be uncorrelated if 



Rx{t, s) 



E(Xf) ilt = s 
EXtEXs ift^s 



or, equivalently, if 



Kx{t, s) 



if t = s 
0 if t yf s 



The reader should not overlook the obvious fact that if a process is iid 
or uncorrelated, the random variables are independent or uncorrelated only 
if taken at different times. That is, Xt and Xg will not be independent 
or uncorrelated when t = s, only when t ^ s (except, of course, in such 
trivial cases as that where {Xt} = {at}, a sequence of constants where 
E{XtXt) = Qtttt = EXtEXt and hence Xt is uncorrelated with itself). 
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Gaussian Processes Revisited 

Recall from chapter 3 that a Gaussian random process {Xt; t G T} is 
completely described by a mean function {mt; t G T} and a covariance 
function {A(t, s); t,s G T}. As one might suspect, the names of these 
functions come from the fact that they are indeed the mean and covariance 
functions as defined in terms of expectations, i.e.. 



mt = EXt (4.95) 

A{t,s) = Kx{t,s). (4.96) 

The result for the mean follows immediately from our computation of the 
mean of a Gaussian iV(m,CT^) random variable. The result for the covari- 
ance can be derived by brute force integration (not too bad if the integrator 
is well versed in matrix transformations of multidimensional integrals) or 
looked up in tables somewhere. The computation is tedious and we will 
simply state the result without proof. The multidimensional characteristic 
functions to be introduced later can be used to a relatively simple proof, 
but again it is not worth the effort to fill in the details. 

A more important issue is the properties that were required for a co- 
variance function when the Gaussian process was defined. Recall that it 
was required that the covariance function of the process be symmetric, i.e., 
Kx{t,s) = Kx{s,t), and positive definite, i.e., given any positive integer 
k, any collection of sample times {to, ■ • • Uk-i}, and any k real numbers 
Gi] z = 0, . . . , fc — 1 (not all 0), then 



EE aittiKxiUjti) > 0. (4.97) 

/— 0 

We now return to these conditions to see if they are indeed necessary con- 
ditions for all covariance functions, Gaussian or not. 

Symmetry is easy. It immediately follows from the definitions that 

Kx{t,s) = E[{Xt - EXt)(X, - EX,)] 

= E[{X,-EX,){Xt-EXt)] 

= Kx{s,t) (4.98) 

and hence clearly all covariance functions are symmetric, and so are co- 
variance matrices formed by sampling covariance functions. To see that 
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positive definiteness is indeed almost a requirement, consider the fact that 



2=0 z=o 



EE aiaiKx{ti,ti) = EE amEliXt, - EXt,)(Xt, - EXt,)] 

i—0 l—O 

/n—1 n—1 \ 

= ^ EE a^ai{Xt,-EXt,){Xt,-EXt,) 



^i—0 l—O 
'' n—1 



= E[\J2a^{Xt^-EXt, 

> 0 . 



i=0 



(4.99) 



Thus any covariance function Kx must at least be nonnegative definite, 
which implies that any covariance matrix matrix formed by sampling the 
covariance function must also be nonnegative definite. Thus nonnegative 
definiteness is necessary for a covariance function and our requirement for a 
Gaussian process was only slightly stronger that what was needed. We will 
later see how to define a Gaussian process when the covariance function is 
only nonnegative definite and not necessarily positive definite. 

A slight variation on the above argument shows that if A = (Aq, . . . , Xk-iY 
is any random vector, then the covariance matrix A = i, I G Z^} de- 

fined by = E[{Xi — EXi){Xi — EXi)] must also be symmetric and 
nonnegative definite. This was the reason for assuming that the covariance 
matrix for a Gaussian random vector had at least these properties. 

We make two important observations before proceeding. First, remem- 
ber that the four basic properties of expectation have nothing to do with 
independence. In particular, whether or not the random variables involved 
are independent or uncorrelated, one can always interchange the expecta- 
tion operation and the summation operation (property 3), because expec- 
tation is linear. On the other hand, one cannot interchange the expectation 
operation with the product operation (this is not a property of expectation) 
unless the random variables involved are uncorrelated, e.g., when they are 
independent. Second, an iid process is also a discrete time uncorrelated 
random process with identical marginal distributions. The converse state- 
ment is not true in general; that is, the notion of an uncorrelated process is 
more general than that of an iid process. Gorrelation measures only a weak 
pairwise degree of independence. A random process could even be pairwise 
independent (and hence uncorrelated) but still not be iid (problem 4.28). 
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4.12 7»rThe Central Limit Theorem 



The characteristic function of a sum of iid Gaussian random variables has 
been shown to also be Gaussian and linear combinations of jointly Gaussian 
variables have also be shown to be Gaussian. Far more surprising is that the 
characteristic function of the sum of many non-Gaussian random variables 
turns out to be approximately Gaussian if the variables are suitably scaled 
and shifted. This result is called the central limit theorem and is the one of 
the primary reasons for the importance of Gaussian distributions. When a 
large number of effects are added up with suitable scaling and shifting, the 
resulting random variable looks Gaussian even if the underlying individual 
effects are not at all Gaussian. This result is developed in this subsection. 

Just as with laws of large numbers, there is no single central limit the- 
orem — there are many versions of central limit theorems. The various 
central limit theorems differ in the conditions of applicability. However, 
they have a common conclusion: the distribution or characteristic function 
of the sum of a collection of random variables converges to that of a Gaus- 
sian random variable. We will present only the simplest form of central 
limit theorem, a central limit theorem for iid random variables. 

Suppose that {X„,} is an iid random process with a common distribution 
Ex described by a pmf or pdf except that it has a finite mean EX^ = m 
and finite variance cr^ = a^. It will also be assumed that the characteristic 
function Mx (ju) is well behaved for small u in a manner to be made precise. 
Gonsider the “standardized” or “normalized” sum 



R 



n 



1 Xi — m 

j.jl/2 fj 

k—0 



(4.100) 



By subtracting the means and dividing by the square root of the variance 
(the standard deviation), the resulting random variable is easily seen to have 
zero mean and unit variance; that is, 

ERn = 0 7 = 1 7 

hence the description “standardized,” or “normalized.” Note that unlike 
the sample average that appears in the law of large numbers, the sum here 
is normalized by and not n~^. 

Using characteristic functions, we have from the independence of the 
{Xi} and lemma 4.1 that 




^Rn{j'^) = ^(X-m)la 



(4.101) 
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We wish to investigate the asymptotic behavior of the characteristic 
function of (4.101) as n ^ oo. This is accomplished by assuming that 
is finite and applying the approximation of of (4.16) to M(^x-m)/(r ( _"/ 2 ) 
and then finding the limiting behavior of the expression. Let Y = (X — 
m)ja. Y has zero mean and a second moment of 1, and hence from (4.16) 

^ ^ + o{u^/n) , (4.102) 

where the rightmost term goes to zero faster than v? jn. Combining this 
result with (4.101) produces 



lim MR^{ju) 

n—*oo 



lim 

n— ^oo 



1 - ^ 
2n 







From elementary real analysis, however, this limit is 



lim MR^(ju) = e , 



the characteristic function of a Gaussian random variable with zero mean 
and unit variance! Thus, provided that (4.102) holds, a standardized sum 
of a family of iid random variables has a transform that converges to the 
transform of a Gaussian random variable regardless of the actual marginal 
distribution of the iid sequence. 

By taking inverse transforms, the convergence of transforms implies that 
the cdf’s will also converge to a Gaussian cdf (provided some technical con- 
ditions are satisfied to ensure that the operations of limits and integration 
can be exchanged). This does not imply convergence to a Gaussian pdf, 
however, because, for example, a finite sum of discrete random variables 
will not have a pdf (unless one resorts to Dirac delta functions). Given 
a sequence of random variables with cdf Fn and a random variable R 
with distribution F, then if lim„^oo ^n(?’) = for all real r, we say 

that Rn converges to R in distribution. Thus the central limit theorem 
states that under certain conditions, sums of iid random variables adjusted 
to have zero mean and unit variance converge in distribution to a Gaussian 
random variable with the same mean and variance. 

A slight modification of the above development shows that if {A„} is 
an iid sequence with mean m and variance then 



n—1 

^- 1/2 - m) 

k=0 

will have a transform and a cdf converging to those of a Gaussian random 
variable with mean 0 and variance We summarize the central limit 
theorem that we have established as follows. 
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Theorem 4.10 (A Central Limit Theorem). Let {A„} be an iid ran- 
dom process with a finite mean m and variance a^. Then 

n—1 

- m) 

k=0 

converges in distribution to a Gaussian random variable with mean m and 
variance a^. 

Intuitively the theorem states that if we sum up a large number of in- 
dependent random variables and normalize by so that the variance 

of the normalized sum stays constant, then the resulting sum will be ap- 
proximately Gaussian. For example, a current meter across a resistor will 
measure the effects of the sum of millions of electrons randomly moving 
and colliding with each other. Regardless of the probabilistic description of 
these micro-events, the global current will appear to be Gaussian. Making 
this precise yields a model of thermal noise in resistors. Similarly, if dust 
particles are suspended on a dish of water and subjected to the random 
collisions of millions of molecules, then the motion of any individual par- 
ticle in two dimensions will appear to be Gaussian. Making this rigorous 
yields the classic model for what is called “Brownian motion.” A similar 
development in one dimension yields the Wiener process. 

Note that in (4.101), if the Gaussian characteristic function is substi- 
tuted on the right-hand side, a Gaussian characteristic function appears on 
the left. Thus the central limit theorem says that if you sum up random 
variables, you approach a Gaussian distribution. Once you have a Gaussian 
distribution, you “get stuck” there — adding more random variables of the 
same type (or Gaussian random variables) to the sum does not change the 
Gaussian characteristic. The Gaussian distribution is an example of an in- 
finitely divisible distribution. The root of its characteristic function is 
a distribution of the same type as seen in (4.101). Equivalently stated, the 
distribution class is invariant under summations. 



4.13 Sample Averages 

In many applications, engineers analyze the accuracy of estimates, the prob- 
ability of detector error, etc., as a function of the amount of data available. 
This and the next sections are a prelude to such analyses. They also pro- 
vide some very good practice manipulating expectations and a few results 
of interest in their own right. 

In this section we study the behavior of the arithmetic average of the 
first n values of a discrete time random process with either a discrete or a 
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continuous alphabet. Specifically, the variance of the average is considered 
as a function of n. 

Suppose we are given a process {X„}. The sample average of the first 

n—1 

n values of {X„} is Sn = nT^'^^Xi. The mean of is found easily using 

i=0 

the linearity of expectation (expectation property 3) as 



ESr, = E 



n—1 

i=0 



n—1 

1-1 ^ EX, 

i=0 



(4.103) 



Hence the mean of the sample average is the same as the average of the 
mean of the random variables produced by the process. Suppose now that 
we assume that the mean of the random variables is a constant, EX^ = X 
independent of i. Then ESn = X. In terms of estimation theory, if one 
estimates an unknown random process mean, X, by S'„, then the estimate 
is said to be unbiased because the expected value of the estimate is equal to 
the value being estimated. Obviously an unbiased estimate is not unique, so 
being unbiased is only one desirable characteristic of an estimate (problem 
4.25). 

Next consider the variance of the sample average: 



= i^[(5„ - T;(5„))2] 



= E 



= E 



0 i— 0 

n—1 

^Y,{X,-EXi) 



\ i^O 
n— 1 n— 1 



= n 



■'E ■ 



i=0 j=0 



The reader should be certain that the preceding operations are well under- 
stood, as they are frequently encountered in analyses. Note that expanding 
the square requires the use of separate dummy indices in order to get all 
of the cross products. Once expanded, linearity of expectation permits the 
interchange of expectation and summation. 

Recognizing the expectation in the sum as the covariance function, the 
variance of the sample average becomes 



=n ■ 

i=0 j=0 



(4.104) 
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Note that so far we have used none of the specific knowledge of the pro- 
cess, i.e., the above formula holds for general discrete time processes and 
does not require such assumptions as time-constant mean, time-constant 
variance, identical marginal distributions, independence, uncorrelated pro- 
cesses, etc. If we now use the assumption that the process is uncorrelated, 
the covariance becomes zero except when i = j, and expression (4.104) 
becomes 



n— 1 

■ (4.105) 

i=0 

If we now also assume that the variances are equal to some constant 
value cr^ for all times i, e.g., the process has identical marginal distributions 
as for an iid process, then the equations become 

=n~^a\ . (4.106) 

Thus, for uncorrelated discrete time random processes with mean and 
variance not depending on time, the sample average has expectation equal 
to the (time-constant) mean of the process, and the variance of the sample 
average tends to zero as n ^ oo. Of course we have only specified sufficient 
conditions. Expression (4.104) goes to zero with n under more general 
circumstances, as we shall see later. 

For now, however, we stick with uncorrelated process with mean and 
variance independent of time and require only a definition to obtain our 
first law of large numbers, a result implicit in equation (4.106). 



4.14 Convergence of Random Variables 

The preceding section demonstrated a form of convergence for the sequence 
of random variables, {5'„}, the sequence of sample averages, that is different 
from convergence as it is seen for a nonrandom sequence. To review, a 
nonrandom sequence {x„} is said to converge to the limit x if for every e > 0 
there exists an N such that |cc„ — a;| < e for every n > N. The preceding 
section did not see Sn converge in this sense. Nothing was said about the 
individual realizations 5'„(w) as a function of u>. Only the variance of the 
sequence cr| was shown to converge to zero in the usual e, N sense. The 
variance calculation probabilistically averages across uj. For any particular 
uj, the realization S'„ may, in fact, not converge to zero. 

Thus, in order to make precise the notion of convergence of sample 
averages to a limit, we need to make precise the notion of convergence of a 
sequence of random variables. In this section we will describe four notions 
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of convergence of random variables. These are perhaps the most commonly 
encountered, but they are by no means an exhaustive list. The common 
goal is to quantify a useful definition for saying that a sequence of random 
variables, say n = 1, 2, . . . , converges to a random variable Y, which 
will be considered the limit of the sequence. Our main application will 
be the case where Yn = Sn, a sample average of n samples of a random 
process, and Y is the expectation of the samples, that is, the limit is a 
trivial random variable, a constant. 

The most straightforward generalization of the usual idea of a limit to 
random variables is easy to define, but virtually useless. If for every sample 
point ijj we had lim„^oo Yn{u>) = Y{uj) in the usual sense of convergence of 
numbers, then we could say that Y„ converges pointwise to Y, that is, for 
every sample point in the sample space. Unfortunately it is rarely possible 
to prove so strong a result, nor is it necessary. 

A slight variation of this yields a far more important important notion 
of convergence. A sequence of random variables Yn, n = 1, 2, . . . , is said 
to converge to a random variable Y with prohahility one or convergence 
w.p. 1 if the set of samples points u such that lim„^ooF„(a;) = Y{uj) is 
an event with probability one. Thus a sequence converges with probability 
one if it converges pointwise on a set of probability one, it can do anything 
outside of that set, e.g., converge to something else or not converge at 
all. Since the total probability of all such bad sequences is 0, this has no 
practical significance. Although the easiest useful concept of convergence 
to define, it is the most difficult to work with and most proofs involving 
convergence with probability are far beyond the mathematical prerequisites 
and capabilities of this course. Hence we will focus on two other notions 
of convergence that are perhaps less intuitive to understand, but are far 
easier to use when proving results. First note, however, that there are 
many equivalent names for convergence with probability one. It is often 
called convergence almost surely and abbreviated a.s. or convergence almost 
everywhere and abbreviated a.e. Convergence with probability one will not 
be considered in any depth here, but some toy examples will be considered 
in the problems to help get the concept across. 

Henceforth two definitions of convergence of random variables will be 
emphasized, both well suited to the type of results developed here (and 
one that is used in the first such results, Bernoulli’s weak law of large 
numbers for iid random processes) . The first is convergence in mean square, 
convergence of the type seen in the last section, which leads to a result called 
a mean ergodic theorem. The second is called convergence in probability, 
which is implied by the first and leads to a result called the weak law of 
large numbers. The second result will follow from the first via a simple but 
powerful inequality relating probabilities and expectations. 
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A sequence of random variables n = 1, 2, . . . is said to converge in 
mean square or converge in quadratic mean to a random variable Y if 

lim A[(y„ - r)2] = 0 . 

n— ^oo 

This is also written Yn ^ Y in mean square or Yn Y in quadratic mean. 

If Yn converges to Y in mean square, we state this convergence mathe- 
matically by writing 

l.i.m. Y„ = Y , 

n—*oo 

where lim is an acronym for “limit in the mean.” Although it is likely not 
obvious to the novice, it is important to understand that convergence in 
mean square does not imply convergence with probability one. Examples 
converging in one sense and not the other may be found in problem 32 

Thus a sequence of random variables converges in mean square to an- 
other random variable if the second moment of the difference converges to 
zero in the ordinary sense of convergence of a sequence of real numbers. Al- 
though the definition encompasses convergence to a random variable with 
any degree of “randomness,” in most applications that we shall encounter 
the limiting random variable is a degenerate random variable, i.e., a con- 
stant. In particular, the sequence of sample averages, {An}, of the preceding 
section is next seen to converge in this sense. 

The final notion of convergence bares a strong resemblance to the notion 
of convergence with probability one, but the resemblance is a faux ami, the 
two notions are fundamentally different. A sequence of random variables 
Yn, n = 1, 2, . . . is said to converge in probability to a random variable Y 
if for every e > 0, 

lim Pr(|y„ — F| > e) = 0 . 

n — »-oo 

Thus a sequence of random variables converges in probability if the proba- 
bility that the member of the sequence differs from the limit by more 
than an arbitrarily small e goes to zero as n — > oo. Note that just as 
with convergence in mean square, convergence in probability is silent on 
the question of convergence of individual realizations T„(w). You could, 
in fact, have no realizations converge individually and yet have conver- 
gence in probability. All convergence in probability states is that at each 
n, Pr(w : \Yn{co) — Y{uj)\ > e) tends to zero with n. Suppose at time n 
a given subset of 0 satisfies the inequality, at time n -I- 2 still a different 
subset satisfies the inequality, etc. As long as the subsets have diminishing 
probability, convergence in probability can occur without convergence of 
the individual sequences. 

Also, as in convergence in the mean square sense, convergence in prob- 
ability is to a random variable in general, but this includes the most inter- 
esting case of a degenerate random variable — i.e., a constant. 
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The two notions of convergence — convergence in mean square and 
convergence in probability — can be related to each other via simple, but 
important, inequalities. It will be seen that convergence in mean square 
is the stronger of the two notions; that is, if it converges in mean square, 
then it also converges in probability, but not necessarily vice versa. The 
two inequalities are slight variations on each other, but they are stated 
separately for clarity and both an elementary and a more elegant proof are 
presented. 

The Tchebychev Inequality 

Suppose that X is a random variable with mean mx, and variance cr^. 
Then 

2 

Pr(|X-mx| > e) < ^. (4.107) 

We prove the result here for the discrete case. The continuous case is 
similar (and can be inferred from the more general proof of the Markov 
inequality to follow.) 

The result follows from a sequence of inequalities. 
a| = E{{X-mxf\ 

= y^(x - mx)‘^Px{x) 

X 

= ^ {x - mx)‘^Px{x) + ^ {x-mx)^Px{x) 

x\\x—mx\"^^ x:\x—mx\>^ 

> ^ {x - mxYpx{x) 

x\\x—mx |>e 

> Px{x) 

x'.lx—mXl^e 

= Pr(|X — mx| > e). 

Note that the Tchebychev inequality implies that 

Pr(|P-F| >7ay) < 4 > 

T 

that is, the probability that V is farther from its mean by more than 7 
times its standard deviation (the square root of its variance) is no greater 
than 7 “^. 



The Markov Inequality. Given a nonnegative random variable U 
with finite expectation EU, for any a > 0 we have 



Pr([7 > a) = P( 7 ([a, 00 )) < 



EU 



a 
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Proof: The result can be approved in the same manner as the Tcheby- 
chev inequality by separate consideration of the discrete and continuous 
cases. Here we give a more general proof. Fix a > 0 and set F = {u : 
u > a}. Let 1 f(w) be the indicator of the function F, 1 if u > a and 0 
otherwise. Then since Fr)F‘^ = and FUF‘^ = Lt, we have using the linearity 
of expectation and the fact that U > 0 with probability one that 

E[U] = F;[[/(1f(C/) + 1f<= ([/))] 

= E[U{1f{U))]+E[U1f4U))] 

> E[U{1f{U))] > aE[lF{U)] 

= aP{F). 

completing the proof. 



Observe that if a random variable U is nonnegative and has small ex- 
pectation, say EU < e, then the Markov inequality with a = implies 
that 

Pr(C7 > y/e) < >/e . 

This can be interpreted as saying that the random variable can take on 
values greater that ^/e no more than ^/e of the time. 

Before applying this result, we pause to present a second proof of the 
Markov inequality that has a side result of some interest in its own right. 
As before assume that > 0. Assume for the moment that U is continuous 
so that 

poo 

E[U]= / xfx{x)dx. 

Jo 

Consider the admittedly strange looking equality 




which follows since the integrand is 1 if and only if a < x and hence 
integrating 1 as a ranges from 0 to x yields x. Plug this equality into 
the previous integral expression for expectation and changing the order of 
integration yields 



E[U] 



oo / /*oo 



/O \J0 

poo / POO 



lO \J0 



l[a,oo)(a;) daj fx{x) dx 
^[a,oo){x)fx{x) dx] da, 
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which can be expressed as 

E[U]= / Pr{U>a)da= {1-Fu{a))da. (4.108) 

Jo Jo 

This result immediately gives the Markov inequality since for any fixed 
a > 0, 

poo 

> a) da> aPr{U > a). 

Jo 

To see this, Pr({7 > a) is monotonically nonincreasing with a, so for all 
a < a we must have Pr([7 > a) > Pr([7 > a) (and for other a Pr([7 > a) > 
0). Plugging the bound into the integral yields the claimed inequality. 

Lemma 4.3 If Y„ converges to Y in mean square, then it also converges 
in probability. 

Proof. From the Markov inequality applied to \Y„ — Y\'^, we have for 
any e > 0 

Pr(|y„ - r| > e) = Pr(|y„ - r|2 > e^) < . 

c 

The right-hand term goes to zero as n ^ oo by definition of convergence in 
mean square. 



Although convergence in mean square implies convergence in probabil- 
ity, the reverse statement cannot be made; i.e., they are not equivalent. 
This is shown by a simple counterexample. Let be a discrete random 
variable with pmf. 

f 1 — 1/n if?/ = 0 
PY„ = { 1 / -r 

( 1/n 11 y = n . 

Convergence in probability to zero without convergence in mean square is 
easily verified. In particular, the sequence converges in probability since 
Pr[|y„ — 0| > e] = Pr[F„ > 0] = 1/n, which goes to 0 as n ^ oo. On the 
other hand, — Op] would have to go to 0 for to converge to 0 in 

mean square, but it is E[Y.f] = 0(1 — 1/n) -I- n^/n = n, which does not 
converge to 0 as n ^ oo. 

4.15 Weak Law of Large Numbers 

We now have the definitions and preliminaries to prove laws of large num- 
bers showing that sample averages converge to the expectation of the in- 
dividual samples. The basic (and classical) results hold for uncorrelated 
random processes with constant variance. 
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A Mean Ergodic Theorem 

Theorem 4.11 Let {X^} be a discrete time uncorrelated random process 
such that EXn = X is finite and < oo for all n; that is, the 

mean and variance are the same for all sample times. Then 

^ n— 1 

l.i.m. - V Xi = X , 

n— ^oo fi 

^ n—1 

that is, — y Xi ^ X in mean square, 
n 

i=0 

Proof. The proof follows directly from the last section with Sn = 

^ n—1 

— Xij ESn = EXi = X. To summarize from (4.106), 

2=0 

lim E[{Sr, - X)2] = lim E[{Sr,, - ESrfi'^] 

n—*oo n — >-oo 

= lim (Tc 

n— *-oo 

(J V 

= lim ^ = 0 . 

n— ^oo n 

This theorem is called a mean ergodic theorem because it is a special case 
of the more general mean ergodic theorem — it is a special case since it 
holds only for uncorrelated random processes. We shall later consider more 
general results along this line, but this simple result and the one to follow 
provide the basic ideas. 

Combining lemma 4.3 with the mean ergodic theorem 4.11 yields the 
following famous result, one of the original limit theorems of probability 
theory: 



Theorem 4.12 The Weak Law of Large Numbers. 

Let {Xn} be a discrete time process with finite mean EXn = X and 
variance a\ = cf\ < oo for all n. Lf the process is uncorrelated, then the 

n—1 

sample average n converges to X in probability. 

i=0 

An alternative means of describing a law of large numbers is to define the 
limiting time-average or sample average of a sequence of random variables 
{Xn} by 

< Xn >= lim —n 

n — >oc> n 

2=0 






(4.109) 
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if the limit exists in any of the manners considered, e.g., in mean square, 
in probability, or with probability 1. Note that ordinarily the limiting time 
average must be considered as a random variable since it is function of 
random variables. Laws of large numbers then provide conditions under 
which 



<Xn>=E(Xk), (4.110) 

which requires that < X„ > not be a random variable, i.e., that it be a 
constant and not vary with the underlying sample point w, and that E(Xk) 
not depend on time, i.e., that it be a constant and not vary with time k. 

The best-known (and earliest) application of the weak law of large num- 
bers is to iid processes such as the Bernoulli process. Note that the iid 
specification is not needed, however. All that is used for the weak law of 
large numbers is constant means, constant variances, and uncorrelation. 
The actual distributions could be time varying and dependent within these 
constraints. The weak law is called weak because convergence in probabil- 
ity is one of the weaker forms of convergence. Convergence of individual 
realizations of the random process is not assured. This could be very an- 
noying because in many practical engineering applications, we have only 
one realization to work with (i.e., only one lu), and we need to calculate 
averages that converge as determined by actual calculations, e.g., with a 
computer. 

The strong law of large numbers considers convergence with probability 
one. Such strong theorems are much harder to prove, but fortunately are 
satisfied in most engineering situations. 

The astute reader may have noticed the remarkable difference in be- 
havior caused by the apparently slight change of division by ^/n instead 
of n when normalizing sums of iid random variables. In particular, if 
{Xn} is a zero mean process with unit variance, then the weighted sum 
j.j-i /2 Xk converges to a Gaussian random variable in some sense be- 

cause of the central limit theorem, while the weighted sum n~^ X)fe=o 
converges to a constant, the mean 0 of the individual random variables! 

4.16 ^Strong Law of Large Numbers 

The strong law of large numbers replaces the convergence in probability 
of the weak law with convergence with probability one. It will shortly 
be shown that convergence with probability one implies convergence in 
probability, so the “strong” law is indeed stronger than the “weak” law. 
Although the two terms sound the same, they are really quite different. 
Convergence with probability one applies to individual realizations of the 
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random process, while convergence in probability does not. Convergence 
with probability one is closer to the usual definition of convergence of a 
sequence of numbers since it says that for each sample point u, the limiting 
sample average lim„^oo exists in the usual sense for all to 

in a set of probability one. Although a more satisfying notion of conver- 
gence, it is notably harder to prove than the weaker result and hence we 
consider only the special case of iid sequences, where the added difficulty 
is moderate. In this section convergence with probability one is consid- 
ered and a strong law of large numbers is proved. The key new tools are 
the Borel-Cantelli lemma, which provides a condition ensuring convergence 
with probability one, and the Chernoff inequality, an improvement on the 
Tchebychev inequality which is a simple result of the Markov inequality. 

Lemma 4.4 If Yn converges to Y with probability one, then it also con- 
verges in probability. 

Proof: Given an e > 0, define the sequence of sets 

Fn{c) = {lo : \Ym{co) — L(u;)| > e for some m > n}. 

The Fn{e) form a decreasing sequence of sets as n grows, that is, Fn C Fn-i 
for all n. Thus Pr(T'„) is nonincreasing in n and hence it must converge to 
some limit. From the definition of convergence with probability one, this 
limit must be 0 since if Y„{lo) converges to Y (w), given e there must be an 
n such that for all m > n |T„(w) — Y{uj)\ < e. Thus 

lim Pr(|y„ — > e) < lim Pr(F„(e)) = 0, 

n — »-oo n — »-oo 

which establishes convergence in probability. 

Convergence in probability does not imply convergence with probability 
one; i.e., they are not equivalent. This can be shown by counterexample 
(problem 32). There is, however, a test that can be applied to determine 
convergence with probability one. The result is one form of a result known 
as the first Borel-Cantelli lemma.. 

Lemma 4.5 Y„ converges to Y with probability one if for any e > 0 

OO 

^Pr(|r„-P| >e) <oo. (4.111) 

n—1 

Proof: Consider two collections of bad sequences. Let F{e) be the set of 
all uj such that the corresponding sequence sequence Y„(uj) does not satisfy 
the convergence criterion, i.e., 

F{e) = {w : \Yn — F| > e, for some n> N, for any N < oo}. 
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F(e) is the set of points for which the sequence does not converge. Consider 
also the simpler sets where things look bad at a particular time: 

F„(e) = {io : \Y„ - F| > e}. 

The complicated collection of points with nonconvergent sequences can be 
written as a subset of the union of all of the simpler sets: 

OO 

F{e) C U F„(e) = G^(e)) 

n>N 

for any finite N. This in turn implies that 

OO 

Pr(ne))<Pr(U F„(e)). 

n>N 

From the union bound this implies that 

OO 

Pr{F{e)) < Pr(F„(e)). 

n=N 



By assumption 

OO 

y]Pr(F„(e)) < OO, 
n— 0 

which implies that 

OO 

lim Pr(if«(e)) = 0 
AT— »-oo 

n^N 

and hence Pr(i^(e)) = 0, proving the result. 

Convergence with probability one does not imply — nor is it implied 
by — convergence in mean square. This can be shown by counterexamples 
(problem 32). 

We now apply this result to sample averages to obtain a strong law of 
large numbers for an iid random process {Xn}. For simplicity we focus on 
a zero mean Gaussian iid process and prove that with probability one 

lim Sn = 0 



n—1 



5„= - VXfc. 

71 






where 
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Assuming zero mean does not lose any generality since if this result is true, 
the result for nonzero mean m follows immediately by applying the zero 
mean result to the process to the zero-mean process {A„ — to}. 

The approach is to use the Borel-Cantelli lemma with that = Sn and 
Y = 0 = E[Xn] and hence the immediate problem is to bound Pr(|S'„| > e) 
in a way so that that the sum over n will be finite. The Tchebychev 
inequality does not work here as it would give a sum 




which is not finite. A better upper bound than Tchebychev is needed, and 
this is provided by a different application of the Markov inequality. Given 
a random variable T, fix a A > 0 and observe that Y > y ii and only if 
. Application of the Markov inequality then yields 

Pr(F >y) = Pr(e^'^ > 

= Pr(e^(’^-^) > 1) 

< (4.112) 

This inequality is called the ChernojJ inequality and it provides the needed 
bound. 

Applying the Chernoff inequality yields for any A > 0 

Pr(|S'„| > e) = Pr(S'„ > e) -k Pr(S'„ < -e) 

= Pr(S'„ > e) -k Pr(-S'„ > e) 

< -k 

= -k 

= e-^^(Ms„(A)+Ms„(-A)). 

These moment generating functions are easily found from lemma 4.1 to be 

E[e^^A = (4.113) 

n 

Where Mx(ju) = E[e^'^^] is the common characteristic function of the iid 
Xi and Mx{w) is the corresponding moment generating function. Com- 
bining these steps yields the bound 

\ n n J 



Pr(|5'„| > e) < e' 



(4.114) 
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So far A > 0 is completely arbitrary and we can choose a different A for 
each n. Choosing A = ne/a\ yields 

Pr(|^„| > e) < • (4-H5) 

Plugging in the form for the Gaussian moment generating function Mx (w) = 



Pr(|^„| > e) 



< 






(4.116) 



which has the form Pr(|S'„| > e) < /3" for (3 < \. Hence summing a 
geometric progression yields 



EPr(l^n|>e) < 

n=l 



2E/3' 

n—1 



2 



/3 

1-/3 



< oo, 



(4.117) 



which completes the proof for the iid Gaussian case. 



The nonGaussian case can be handled by combining the above approach 
with the approximation of (4.16). The bound for the Borel-Cantelli limit 
need only be demonstrated for small e since if it is true for small e it must 
also be true for large e. For small e, however, (4.16) implies that Mx(±y|-) 
in (4.115) can be written as 1 + j2a\) which is arbitrarily 

1 In 1 

close to e'^ ' v for sufficiently small e, and the proof is completed as above. 

The following theorem summarizes the results of this section. 

Theorem 4.13 Strong Law of Large Numbers 

Given an iid process {Xn} with finite mean E[X] and variance, then 

^ n— 1 

lim — y Xk = E[X] with probability 1. 

n — »^oo 77, * ^ 

fc=0 



(4.118) 
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4.17 Stationarity 

Stationarity Properties 

In the development of the weak law of large numbers we made two as- 
sumptions on a random process {Xt; n G Z}\ that the mean EXt of the 
process did not depend on time and that the covariance function had the 
form Kx{t,s) = a\5t-s- 

The assumption of a constant mean, independent of time, is an example 
of a stationarity property in the sense that it assumes that some property 
describing a random process does not vary with time (or is time-invariant). 
The process itself is not usually “stationary” in the usual literal sense of 
remaining still, but attributes of the process, such as the first moment in 
this case, can remain still in the sense of not changing with time. In the 
mean example we can also express this as 

EXt = EXt+r, alH,r, (4.119) 

which can interpret as saying that the mean of a random variable at time t is 
not affected by a shift of any amount of time t. Conditions on moments can 
be thought of as weak stationarity properties since they constrain only an 
expectation and not the distribution itself. Instead of simply constraining 
a moment, we could make the stronger assumption of constraining the 
marginal distribution. The assumption of a constant mean would follow, 
for example, if the marginal distribution of the process, the distribution 
of a single random variable Xt, did not depend on the sample time t. 
Thus a sufficient (but not necessary) condition for ensuring that a random 
process has a constant mean is that its marginal distribution Px^ satisfies 
the condition 



Pxt=Pxt+A alH,r. (4.120) 

This will be true, for example, if the same relation holds with the distribu- 
tion replaced by cdf’s, pdf’s, or pmf’s. If a process meets this condition, it 
is said to be first order stationary. For example, an iid process is clearly 
first order stationary. The word stationary refers to the fact that the first 
order distribution (in this case) does not change with time, i.e., it is not 
affected by shifting the sample time by an amount t. 

Next consider the covariance used to prove the weak law of large num- 
bers. It has a very special form in that it is the variance if the two sample 
times are the same, and zero otherwise. This class of constant mean, con- 
stant variance, and uncorrelated processes is admittedly a very special case. 
A more general class of processes which will share many important proper- 
ties with this very special case is formed by requiring a mean and variance 
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that do not change with time, but easing the restriction on the covariance. 
We say that a random process is weakly stationary or stationary in the weak 
sense if EXt does not depend on t, does not depend on t, and if the 
covariance Kx{t,s) depends on t and s only through the difference t — s, 
that is, if 

Kx{t,s) = Kx{t + T,s + T) (4.121) 

for all t, s, T for which s, s + T,t,t T t G T. When this is true, it is often 
expressed by writing 



Kx{t,t + T) = Kx{t). 



(4.122) 



for all t, T such that t,t + t G T . A function of two variables of this type 
is said to be Toeplitz [26, 21] and much of the theory of weakly stationary 
processes follows from the theory of Toeplitz forms. 

If we form a covariance matrix by sampling such a covariance function, 
then the matrix (called a Toeplitz matrix) while have the property that all 
elements on any fixed diagonal of the matrix will be equal. For example, the 
(3,5) element will be the same as the (7,9) element since 5-3=9-7. Thus, for 
example, if the sample times are 0 , 1 , . . . ,n—l, then the covariance matrix 
is {Kx{k,j) = Kx{j - fc); fc = 0, 1, . . . , n - 1, j = 0, 1, . . . ,n - 1 or 

Kx{0) Kx{l) Kx{2) ■■■ Kx{n-1)' 

Kx{-1) Kx{0) Kx{l) 

Kx{-2) Kx{-1) Kx{0) 



Kx{-{n-l)) 



Kx{0) 



As in the case of the constant mean, the adjective weakly refers to the 
fact that the constraint is placed on the moments and not on the distri- 
butions. Mimicking the earlier discussion, we could make a stronger as- 
sumption that is sufficient to ensure weak stationarity. A process is said to 
be second order stationary if the pairwise distributions are not affected by 
shifting, that is, if analogous to the moment condition (4.121) we make the 
stronger assumption that 



Pxt,x, = Pxt+^,x,+^; allt,s,T. (4.123) 

Observe that second order stationarity implies first order since the marginals 
can be computed from the joints. The class of iid processes is second order 
stationary since the joint probabilities are products of the marginals, which 
do not depend on time. 
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There are a variety of such stationarity properties that can be defined, 
but weakly stationary is one of the two most important for two reasons. 
The first reason will be seen shortly — combining weak stationarity with 
an asymptotic version of uncorrelated gives a more general law of large 
numbers than the ones derived previously. The second reason will be seen 
in the next chapter: if a covariance depends only on a single argument 
(the difference of the sample times), then it will have an ordinary Fourier 
transform. Transforms of correlation and covariance functions provide a 
useful analysis tool for stochastic systems. 

It is useful before proceeding to consider the other most important sta- 
tionarity property: strict stationarity (sometimes the adjective “strict” is 
omitted). As the notion of weak stationary can be considered as a gener- 
alization of uncorrelated, the notion of strict stationary can be considered 
as a generalization of iid: if a process is iid, the probability distribution 
of a /c-dimensional random vector X„, A„+i, . . . , Xn+k-i does not depend 
on the starting time of the collection of samples, i.e., for an iid process we 
have that 

= fx„+„,.x„+„+i,... ,x„+^+k-i (x): all n, k, m. (4.124) 

This property can be interpreted as saying that the probability of any 
event involving a finite collection of samples of the random process does not 
depend on the starting time n of the samples and hence on the definition of 
time 0. Alternatively, these joint distributions are not affected by shifting 
the samples by a common amount m. In the simple Bernoulli process case 
this means things like 

Px„ (0) = pxo (0) = 1 - P, all n 
Px„.Xfc(0,l) = pxo.Xfc_„(0, 1) =p(l -p), all n,fc 
Px„.x;,.x, (0,1,0) = pxo.Xfc_„.Xi_„(0, 1,0) = (1 -p)^p, all n,/c,m, 

and so on. Note that the relative sample times stay the same, that is, the 
differences between the sample times are preserved, but all of the samples 
together are shifted without changing the probabilities. A process need not 
be iid to possess this property of joint probabilities being unaffected by 
shifts, so we formalize this idea with a definition. 

A discrete time random process {X„} is said to be stationary or strictly 
stationary or stationary in the strict sense if (4.124) holds. We have ar- 
gued that a discrete alphabet iid process is an example of a stationary 
random process. This definition extends immediately to continuous alpha- 
bet discrete time processes by replacing the pmf’s by pdf’s. Both cases 
can be combined by using cdf’s or the distributions. Hence we can make a 
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more general definition for discrete time processes: A discrete time random 
process {A„} is said to be stationary if 

^Xn^Xn + l,--- ,Xn + k-l ^Xn + m Nn + m + 1 t-- Nn + m + k-1 ^ hjTl^Tfl. (4.125) 

This will hold if the corresponding formula holds for pmf’s, pdf’s, or cdf’s. 
For example, any iid random process is stationary. 

Generalizing the definition to include continuous time random processes 
requires only a little more work, much like that used to describe the Kol- 
mogorov extension theorem. We would like all joint distributions involving 
a finite collection of samples to not depend on the starting time or, equiv- 
alently, to not be effected by shifts. The following general definition does 
this and it reduces to the previous definition when the process is a discrete 
time process. 

A random process {Xt, t G T} is stationary if 



(4.126) 

The word “all” above must be interpreted with care, it means all choices of 
dimension k, sample times to, . . . and shift r for which the equation 

makes sense, e.g., k must be a positive integer and ti G T and ti — t G T 
for i = 0, . . . ,k — 1. 

It should be obvious that strict stationarity implies weak stationarity 
since it implies that Pxt does not depend on t, and hence the mean com- 
puted from this distribution does not depend on t, and Pxt,Xs = Pxt-s,Xo 
and hence Kx{t,s) = Kx{t — s,0). The converse is generally not true 
— knowing that two moments are unaffected by shifts does not in gen- 
eral imply that all finite dimensional distributions will be unaffected by 
shifts. This is why weak stationarity is indeed a “weaker” definition of 
stationarity. There is, however, one extremely important case where weak 
stationarity is sufficient to ensure strict stationarity - the case of Gaussian 
random processes. We shall not construct a careful proof of this fact be- 
cause it is a notational mess that obscures the basic idea, which is actually 
rather easy to describe. A Gaussian process {Xt; t G T} is completely 
characterized by knowledge of its mean function {nit; t G P{ and its co- 
variance function {Kx{t, s); t,s G T}. All joint pdf’s for all possible finite 
collections of sample times are expressed in terms of these two functions. 
If the process is known to be weakly stationary, then nit = nr for all t, and 
Kx{t, s) = Kx{t— s) for all t, s. This implies that all of the joint pdf’s will 
be unaffected by a time shift, since the mean vector stays the same and the 
covariance matrix depends only on the relative differences of the sample 




4.17. STATIONARITY 



255 



times, not on where they begin. Thus in this special case, knowing a pro- 
cess is weakly stationary is sufficient to conclude it is stationary. In general, 
stationarity can be quite difficult to prove, even for simple processes. 

^Strict Stationarity 

In fact the above is not the definition of stationarity used in the mathe- 
matical and statistical literature, but it is equivalent to it. We pause for a 
moment to describe the more fundamental (but abstract) definition and its 
relation to the above definition, but the reader should keep in mind that 
it is the above definition that is the important one for practice: it is the 
definition that is almost always used to verify that a process is stationary 
or not. 

To state the alternative definition, recall that a random process {Xt; t G 
T} can be considered to be a mapping from a probability space (Q.,T,P) 
into a space of sequences or waveforms {xt, t G T} and that the inverse 
image formula implies a probability measure called a process distribution, 
say Px, on this complicated space, i.e., Px{F) = Px{{{xt, t G T} ■. 
{xp, t G T} G F}) = P{{uJ ■ {Xt{uj); t G T} G F}). The abstract 
definition of stationarity places a condition on the process distribution: a 
random process {Xp, t gT} is stationary if the process distribution Px is 
unchanged by shifting, that is, if 

Px{{{xt', t G T} : {xp, t G T} G Fj) = 

Fx({W, tGT}: {xt+r; tGTjG F}); all F,r. (4.127) 

The only difference between the left and right hand side is that the right 
hand side takes every sample waveform and shifts it by a common amount 
T. If the abstract definition is applied to finite-dimensional events, that 
is, events which actually depend only on a finite number of sample times, 
then this definition reduces to that of (4.126). Conversely, it turns out that 
having this property hold only on all finite-dimensional events is enough to 
imply that the property holds for all possible events, even those depending 
on an infinite number of samples (such as the event one gets an infinite 
binary sequence with exactly p limiting relative frequency of heads) . Thus 
the two definitions of strict stationarity are equivalent. 

Why is stationary important? Are processes that are not stationary in- 
teresting? The answer to the first question is that this property leads to the 
most famous of the law of large numbers, which will be quoted without proof 
later. The answer to the second question is yes, nonstationary processes 
play an important role in theory and practice, as will be seen by exam- 
ple. In particular, some nonstationary processes will have a form of law of 
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large numbers, and others will have no such property, yet be quite useful in 
modeling real phenomena. Keep in mind that strict stationarity is stronger 
than weak stationarity. Thus if a process is not even weakly stationary then 
the process is also not strictly stationary. Two examples of nonstationary 
processes already encountered are the Binomial counting process and the 
discrete time Wiener process. These processes have marginal distributions 
which change with time and hence the processes cannot be stationary. We 
shall see in chapter 5 that these processes are also not weakly stationary. 



4.18 Asymptotically Uncorrelated Processes 

We close this chapter with a generalization of the mean ergodic theorem 
and the weak law of large numbers that demonstrates that weak stationarity 
plus an asymptotic form of uncorrelation is sufficient to yield a weak law of 
large numbers by a fairly modest variation of the earlier proof. The class 
of asymptotically uncorrelated processes is often encountered in practice. 
Only the result itself is important, the proof is a straightforward but tedious 
extension of the proof for the uncorrelated case. 

An advantage of this more general result over the result for uncorre- 
lated discrete time random processes is that it extends in a sensible way to 
continuous time processes. 

A discrete time weakly stationary process {AT„; n G Z} is said to be 
asymptotically uncorrelated if its covariance function is absolutely summable, 
that is, if 



^ |Kx(fc)|<oo. (4.128) 

k— — oo 

This condition implies that also 

lim Kx{k) = 0, (4.129) 

k — »-oo 

and hence this property can be considered as a weak form of uncorrelation, 
a generalization of the fact that a weakly stationary process is uncorrelated 
if Kx{k) = 0 when A: yf 0. If a process is process is uncorrelated, then 
Xn and Xn+k are uncorrelated random variables for all nonzero fc, if it 
is asymptotically uncorrelated, the correlation between the two random 
variables decreases to zero as k grows. We use (4.128) rather than (4.129) 
as the definition as it also ensures the existence of a Fourier transform of 
Kx, which will be useful later, and simplifies the proof of the resulting law 
of large numbers. 
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Theorem 4.14 (A mean ergodic theorem): Let {Xn} be a weakly sta- 
tionary asymptotically uncorrelated discrete time random process such that 
EXn = X is finite and = a\ < oo for all n. Then . 

. n— 1 

l.i.m. — Xi = X , 

n-^-oc) Ti 

2=0 



^ n— 1 

that is, — y Xi X in mean square, 
n 

i=0 

Note that the theorem is indeed a generalization of the previous mean 
ergodic theorem since a weakly stationary uncorrelated process is trivially 
an asymptotically uncorrelated process. Note also that the Tchebychev 
inequality and this theorem immediately imply convergence in probability 
and hence a weak law of large numbers for weakly stationary asymptotically 
uncorrelated processes. A common example of asymptotically uncorrelated 
processes are processes with exponentially decreasing covariance, i.e., of the 
form Kx{k) = for p < 1. 

*Proof: 

Exactly as in the proof of Theorem 4.11 
that 

E[{S„-Xf] = 



we have with with S'„ = n ^ 
E[{Sn - ESn)^] 



From (4.104) we have that 

n— 1 n—1 

al^=n-^'^'^Kx{i-j). (4.130) 

j—0 

This sum can be rearranged as in Lemma B.l of appendix B as 

A = E (1 - (4.131) 

k-—n-\-l 

From Lemma B.2 

n— 1 I T I oo 

lim {I- Ll-^Kxik) = Kx{k), 

n— ^oo Tl 

k—-n-\-l k— — oo 
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which is finite by assumption, hence dividing by n yields 



lim — (1 

n — ^oo 71 * ^ 






= 0 . 
n 



In a similar manner, a continuous time weakly stationary process {X(t); t € 
3?} is said to be asymptotically uncorrelated if its covariance function is ab- 
solutely integrable. 



\Kx{t)\ < oo. 



(4.132) 



which implies that 



lim Kx{t) = 0. (4.133) 



No sensible continuous time random process can be uncorrelated {why 
not?), but many are asymptotically uncorrelated. For a continuous time 
process a sample or time average can be defined by replacing the sum op- 
eration by an integral, that is, by 

ST=^£x{t)dt. (4.134) 

(We will ignore the technical difficulties that must be considered to assure 
that the integral exists in a suitable fashion. Suffice it to say that an integral 
can be considered as a limit of sums, and we have seen ways to make such 
limits of random variables precise.) The definition of weakly stationary 
extends immediately to continuous time processes. The following result 
can be proved by extending the discrete time result to continuous time and 
integrals. 



Theorem 4.15 (A mean ergodic theorem): Let {X(t)} be a weakly sta- 
tionary asymptotically uncorrelated continuous time random process such 
that EX{t) = X is finite and = o'x < oo for all t. Then . 

1 - 
l.i.m. — / X{t)dt = X, 
t^^T Jo ^ ^ 



that is, — X{t) dt ^ X in mean square. 

T Jo 

As in the discrete time case, convergence in mean square immediately 
implies converges in probability, but much additional work is required to 
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prove convergence with probability one. Also as in the discrete case, we 
can define a limiting time average 

< X{t) >= lim ^ ( X{t)dt (4.135) 

T^oo T Jo 

and interpret the law of large numbers as stating that the time average 
< X{t) > exists in some sense and equals the expectation. 

4.19 Problems 

1. The Cauchy pdf is defined by 

fx{x) = - ^ ^ ; a; G 3? . 

n 1 + 

Find EX. Hint: This is a trick question. Check the definition of 
Riemann integration over (— 00 , 00 ) before deciding on a final answer. 

2. Suppose that Z is a discrete random variable with probability mass 
function 

= ^(T^)fc+T’ k = 0O,-- - ■ 

(This is sometimes called “Pascal’s distribution.”) Find the constant 
C and the mean, characteristic function, and variance of Z . 

3. State and prove the fundamental theorem of expectation for the case 
where a discrete random variable X is defined on a probability space 
where the probability measures is described by a pdf /. 

4. Suppose that A is a random variable with pdf fx{ct) and character- 
istic function Mx{ju) = E[e^'^^]. Define the new random variable 
Y = aX + b, where both a and b are positive constants. Find the 
pdf fy and characteristic function My{ju) in terms of fx and Mx, 
respectively. 

5. X, Y and Z are iid Gaussian random variables with Af(l, 1) distribu- 
tions. 

Define the random variables: 

V = 2X + Y 
W = 3X-2Z + 5. 

(a) ¥vadE[VW]. 
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(b) Find the 2 parameters that completely specify the random vari- 
able V + W. 

(c) Find the characteristic function of the random vector [V W]*, 
where t denotes “transpose.” 

(d) Find the linear estimator V{W) of V, given W. 

(e) Is this an optimal estimator? Why? 

(f) The zero-mean random variables X — X, Y — Y and Z — Z 
are the inputs to a black box. There are 2 outputs, A and B. 
It is determined that the covariance matrix of the vector of its 
outputs [A BY should be 



Aab 



3 2 
2 5 



Find expressions for A and B in terms of the black box inputs so 
that this is in fact the case (design the black box). Your answer 
does not necessarily have to be unique. 

(g) You are told that a different black box results in an output vector 
[C DY with the following covariance matrix: 



Acd 



2 0 
0 7 



How much information about output C does output D give you? 
Briefly but fully justify your answer. 



6. Assume that {Y„} is an iid process with Poisson marginal pmf 

A'e-^ 

px{l) = — ^ = 0, 1,2, . . . . 
and deflne the process {Nk] k = 0,1,2, .. .} 



Nk 



0 fc = 0 

EhXi fc=i,2,... 



Deflne the process {Ffc} by Yk = (—1)^'“ for A: = 0, 1, 2, 



(a) Find the mean E[Nk], characteristic function {ju) = E[e^'^^'^ 
and pmf (m) . 

(b) Find the mean A[Yfc] and variance 
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(c) Find the conditional pmfspAr^ I ati.tvs,... • ■ • ,nk-i) 

and pN^iNk-iinklrik-i) . Is {Nk} a Markov process? 



7. Let {Xn} be an iid binary random process with equal probability of 
+ 1 or —1 occurring at any time n. Show that if Yn is the standardized 
sum 

n— 1 

then 

MY„{ju) = . 

Find the limit of this expression as n oo. 



8. Suppose that a fair coin is flipped 1,000,000 times. Write an exact 
expression for the probability that between 400,000 and 500,000 heads 
occur. Next use the central limit theorem to And an approximation 
to this probability. Use tables to evaluate the resulting integral. 



9. Using an expansion of the form of equation (4.102), show directly 
that the central limit theorem is satisfied for a sequence of iid random 
variables with pdf 



Try to use the same expansion for 

1 



p{x) = 



7t(1 + 



, X e 3? . 



X e 3? . 



Explain your result. 



10. Suppose that {X„} is a weakly stationary random process with a 
marginal pdf /x(ck) = 1 for 0 < a < 1 and a covariance function 

Kx{k) = 

for all integer k {p < 1). What is 



l.i.m. — 

n — »-oo Tl 






l.i.m. — 

n— ^oo Ji^ 






What is 
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11. If {Xn} is an uncorrelated process with constant first and second 
moments, does it follow for an arbitrary function g that 

n—1 

n~^ ^ g{Xi) oo E[g{x)] 

i^O 

in mean square? (E[g{X)] denotes the unchanging value of E[g{Xn)]-) 
Show that it does follow if the process is iid 

12. Apply problem 4.11 to indicator functions to prove that relative rel- 
ative frequencies of order n converge to pmf’s in mean square and in 

(n) 

probability for iid random processes. That is, if Va ^ is defined as in 
the chapter, then ^ Px{o,) ad n ^ oo in both senses for any a 
in the range space of X. 

13. Define the subsets of the real line 

F„ = |r : |r| > , n= 1,2,... 



and 

F + {0} . 

Show that 

OO 

F^=\jFn. 

n—1 

Use this fact, the Tchebychev inequality, and the continuity of prob- 
ability to show that if a random variable X has variance 0, then 
Pt{\X—EX\ > e|) < 0 independent of G and hence Pr(|A = EX) = 1. 



14. 



True or False? Given a nonnegative random variable A, for any e > 0 
and a > 0. 



Pr(A > e) < 






15. Show that for a discrete random variable X, 



\E{X)\ < E{\X\) . 

Repeat for a continuous random variable. 

16. This problem considers some useful properties of autocorrelation or 
covariance function. 
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(a) Use the fact that E[{Xt — Xg)'^] > 0 to prove that if EXt = EXt) 
for all t and E{Xf = Rx{t,t) = Rx(0,0) for all t — that is, if 
the mean and variance do not depend on time — then 

lBx(t, s)| < Rx(0, 0) 

and 

lJ^x(i,s)l<Kx(0,0) . 

Thus both functions take on their maximum value when t = x. 
This can be interpreted as saying that no random variable can 
be more correlated with a given random variable than it is with 
itself. 

(b) Show that autocorrelations and covariance functions are sym- 
metric functions, e.g., Rx(t,s) = Rx{s,t)- 

17. The Cauchy-Schwarz Inequality: Given random variables X and Y, 
define a = E{X‘^Y^‘^ and b = E{Y‘^Y^‘^. By considering the quantity 
E[{X/a ± Y/b)"^] prove the following inequality: 

\E{XY)\ < ^ 

18. Given two random processes [Xt] t gT} and [Xt] t gT} defined on 
the same probability space, the cross correlation function Rxvit, s); t,s G 
T is defined as 

Rxvit, s) = E{XtYg) . 

since Rx{t,s) = Rxx{t,s). Show that Rxv is not, in general, a 
symmetric function of its arguments. Use the Gauchy-Schwarz in- 
equality of 4.17 to find an upper bound to \Rxv{t, s)| in terms of the 
autocorrelation functions Rx and Ry. 

19. Let 0 be a random variable described by a uniform pdf on [— tt, tt] and 

let U be a random variable with mean m and variance <t^ ; assume that 
0 and Y are independent. Define the random process t G 3?} 

by X{t) = Y cos(27r/ot -I- 0), where /o is a fixed frequency in hertz. 
Find the mean and autocorrelation function of this process. Find the 
limiting time average 



lim — [ X{t)dt . 
T— T Jo ^ ^ 



(Only in trivial processes such as this can one find exactly such a 
limiting time average.) 




264 



CHAPTER 4. EXPECTATION AND AVERAGES 



20. Suppose that {Xn} is an iid process with a uniform pdf on [0,1)- 
Does Yn = X 1 X 2 ■ ■ ■ Xn converge in mean square as n ^ 00 ? If so, 
to what? 

21. Let r(”^(a) denote the relative frequency of the letter a in a sequence 

xo, . . . ,Xn-i- Show that if we define q(a) = then q(a) is a 

valid pmf. (This pmf is called the “sample distribution,” or “empirical 
distribution.”) 

One measure of the distance or difference between two pmf’s p and q 
is 

a 

Show that if the underlying process is iid with marginal pmf p, then 
the empirical pmf will converge to the true pmf in the sense that 

lim Up — = 0. 

n—*oo 

22 . Given two sequences of random variables {Xn, n = 1,2,...} and 
{Yn, n = 1,2, . . .} and a random variable X, suppose that with prob- 
ability one \Xn — X\ < Yn and n and that EYn ^ 0 as n ^ 00 . 
Prove that EXn EX and that Xn converges to X in probability 
as n ^ 00 . 

23. This problem provides another example of the use of covariance func- 
tions. Say that we have a discrete time random process {Xn} with 
a covariance function Kx{t, s) and a mean function = EXn- Say 
that we are told the value of the past sample, say Xn-i = a, and we 
are asked to make a good guess of the next sample on the basis of the 
old sample, furthermore, we are required to make a linear guess or 
estimate, called a prediction, of the form 

Xn{oi) = aa + b , 

for some constants a and b. Use ordinary calculus techniques to find 
the values of a and b that are “best” in the sense of minimizing the 
mean squared error 

E[{Xn - Xn(Xn-l)f] - 

Give your answer in term of the mean and covariance function. Gen- 
eralize to a linear prediction of the form 

Xn {Xn—1 , Xn—m ) — — 1 “1“ ^m^n—m ^ j 

where m is an arbitrary integer, m >2. When is = 0? 
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24. We developed the mean and variance of the sample average Sn for 
the special case of uncorrelated random variables. Evaluate the mean 
and variance of for the opposite extreme, where the are highly 
correlated in the sense that E[XiXk] = E[Xf] for all i, k. 

25. Given n independent random variables Xi, i = 1,2, . . . ,n with vari- 
ances af and means rrii. Define the random variable 

n 

Y = Y,aiX, . 

i=l 

where the are fixed real constants. Find the mean, variance, and 
characteristic function of Y. 

Now let the mean be constant; i.e., mi = m. Find the minimum 
variance of Y over the choice of the {uj} subject to the constraint 
that EY = TO. The result is called the minimum variance unbiased 
estimate of m. 

Now suppose that {Xi; i = 0, 1, . . . } is an iid random process and 
that iV is a Poisson random variable with parameter A and that N is 
independent of the {Xi}. Define the random variable 



Y 

F = 



i=l 






Use iterated expectation to find the mean, variance, and characteristic 
function of Y. 



26. Let the random process of example [3.27] can be expressed as follows: 
Let 0 be a continuous random variable with a pdf 

/e(6') = ^ ; 0 e [--^,+ 11 ] 

and define the process {X{t); t € 3?} by 

X{t) = cos(t + 0) . 

(a) Find the cdf Fx(o){x). 

(b) Find EX ft). 

(c) Find the covariance function Kx{t,s). 

27. Let {Xn} be a random process with mean to and autocorrelation func- 
tion Rxfn, k), and let {fU„} be an iid random process with zero mean 
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and variance a'^. Assume that the two processes are independent of 
each another; that is, any collection of the Xi is independent of any 
collection of the Wj. Form a new random process = A„ + VF„. 
Note: This is a common model for a communication system or mea- 
surement system with {A„} a “signal” process or “source,” {Wn} a 
“noise” process, and {F„} the “received” process; see problem 3.30 
for example. 

(a) Find the mean EY^ and covariance Kyit, s) in terms of the given 
parameters. 

(b) Find the cross-correlation function defined by 

Rxy{k,j) = E[XkYj] . 

(c) As in exercise 4.23, find the minimum mean squared error esti- 
mate of Xn of the form 

A(r„) = ay„ + 6 . 

The resulting estimate is called a filtered value of A„. 

(d) Extend to a linear filtered estimate that uses and Yn-i- 

28. Suppose that there are two independent data sources {Wfin), i = 
1,2}. Each data source is modeled as a Bernoulli random process 
with parameter 1/2. The two sources are encoded for transmission as 
follows: First, three random processes {li(n); i = 1,2,3} are formed, 
where Yi = W\,Y 2 = W 2 , Y 3 = Wi + W 2 , and where the last sum is 
taken modulo 2 and is formed to provide redundancy for noise protec- 
tion in transmission. These are time-multiplexed to form a random 
process {X{3n + i) = Yfin)}. Show that {X{n)} has identically dis- 
tributed components and is pairwise independent but is not iid. 

29. Let {Un', n = 0, 1, . . . ,} be an iid random process with marginal pdf 
fUn — fut the uniform pdf of Problem A.l. In other words, the joint 
pdf’s can be written as 



n— 1 

/^7n(u^) = fuo^Ui,... ■ ■ • jUn-l) = 

i^O 

Find the mean = E[Un] and covariance function Kjj(k,j) = 
E[{Uk — Tnk){Uj — rrij)] for the process and verify that the weak law 
of large numbers holds for this process. 
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30. Let {Un} be an iid process with a uniform marginal pdf on [0, 1) 
Define a new process {bL„; n = 0, 1, . . . , } by Wq = 2Uo and Wn = 

Un + Un -1 for n = 1,2, Find the mean and covariance 

function Kw{k,j)- Does the weak law of large numbers hold for this 
process? Since elementary probability is a prerequisite for this course, 
you should be able to find the pdf /w„ • Do so. 



31. Show that the convergence of the average of the means in (4.103) to a 
constant and convergence of equation (4.104) to zero are sufficient for 
a mean ergodic theorem of the form of theorem 4.11. In what sense 
if any does {Sn} converge? 

32. The purpose of this problem is to demonstrate the relationships among 
the four forms of convergence that we have presented. In each case. 
([0, 1], ,B([0, 1]), P) is the underlying probability space, with probabil- 
ity measure described by the uniform pdf. For each of the following 
sequences of random variables, determine the pmf of {T„}, the senses 
in which the sequences converges, and the random variable and pmf 
to which the sequence converges. 

(a) 



Yn{u;) 



1 if n is odd and oj < 1/2 or n is even and uj > 1/2 
0 otherwise . 



(b) 



r 1 if w < l/n 
( 0 otherwise . 



(c) 



Yn{u;) 



n a bj <l/n 
0 otherwise . 



(d) Divide [0, 1] into a sequence of intervals {Fn} = {[0, 1], [0, 1/2), 
[1/2, 1], [0, 1/3), [1/3, 2/3), [2/3, 1], [0, 1/4), . . . }. Let 



Yn{co) 



1 if uj € Fn 
0 otherwise 



Yn{u;) = 



(e) 



1 

0 



if w < 1/2 -I- 1/n 
otherwise . 
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33. Suppose that X is a random variable with mean m and variance 

a^. Let gk be a deterministic periodic pulse train such that Gk is 1 
whenever A: is a multiple of a fixed positive integer N and gk is 0 for 
all other k. Let C/ be a random variable that is independent of X 
such that pu{u) = 1/N for u = 0, 1, . . . , — 1. Define the random 

process by 

= Xgij^ji 

that is, Yn looks like a periodic pulse train with a randomly selected 
amplitude and a randomly selected phase. Find the mean and co- 
variance functions of the Y process. Find a random variable Y such 
that 

. n— 1 

lim — Yi = Y 

n—^oo Ji • ^ 

2=0 

in the sense of convergence with probability one. (This is an exam- 
ple of a process that is simple enough for the limit to be evaluated 
explicitly.) Under what conditions on the distribution of X does the 
limit equal EYq (and hence the conclusion of the weak law of large 
numbers holds for this process with memory)? 

34. Let {Xn} be an iid zero-mean Gaussian random process with auto- 
correlation function Rx{0) = Let {C/„} be an iid random process 
with Pr(C/„ = 1) = Pr{Un = —1) = 1/2. Assume that the two pro- 
cesses are mutually independent of each other. Define a new random 
process {U„} by 

r„ = C/„A„ . 

(a) Find the autocorrelation function i?>^(fc,j). 

(b) Find the characteristic function My^iju). 

(c) Is {Yn} an iid process? 

(d) Does the sample average 

n—1 

Sn = n-1 y] V 

i=0 

converge in mean square. If so, to what? 

35. Assume that {Xn} is an iid zero-mean Gaussian random process with 
Rx{0) = that {Un} is an iid binary random process with Pr([/„ = 
1) = 1— G and Pr([/„ = 0) =G (in other words, {C/„} is a Bernoulli 
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process with parameter 1— G), and the processes {Xn} and {[/„} are 
mutually independent of each another. Define a new random process 

14 = X„C/„ . 

(This is a model for the output of a communication channel that has 
the X process as an input but has “dropouts — that is, occasionally 
sets an input symbol to zero.) 

(a) Find the mean EVn and characteristic function Mv„)ki=Ee^'^^" . 

(b) Find the mean squared error E[{Xn — 14)^]. 

(c) Find Pr(X„ ^ Vn). 

(d) Find the covariance of 14- 

(e) Is the following true? 




36. Show that convergence in distribution is implied by the other three 
forms of convergence. 

37. Let {Xn} be a finite-alphabet iid random process with marginal pmf 
Px. The entropy of an iid random process is defined as 

H{X) = -^pa:(a;)log px{x) = E{-\og px{X)) , 

X 

where care must be taken to distinguish the use of the symbol X to 
mean the name of the random variable in H{X) and px and its use as 
the random variable itself in the argument of the left-hand expression. 
If the logarithm is base two then the units of entropy are called hits. 
Use the weak law of large numbers to show that 

^ n—1 

log Px{Xi) n 4" oo 

2 = 0 

in the sense of convergence in probability. Show that this implies that 
lim Pr(|pxo.... (^ 0 , ■ . . , ^n-i) - | > e) = 0 

n — >-oo 

for any e > 0. This result was first developed by Claude Shannon and 
is sometimes called the asymptotic equipartition property of informa- 
tion theory. It forms one of the fundamental results of the mathemat- 
ical theory of communication. Roughly stated, with high probability 
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an iid process with produce for large n and n— dimensional sample 
vector X” = (xo,xi,... ,Xn-i) such that the order probability 
mass function evaluated at x" is approximately ; that is, the 

process produces long vectors that appear to have an approximately 
uniform distribution over some collection of possible vectors. 

38. Suppose that {Xn} is a discrete time iid random process with uniform 
marginal pdf’s 

0<a<l 
/x„(a) otherwise. 

Does the sequence of random variables 

n 

z^ = l[x, 

i=l 

converge in probability? If so, to what? 

39. The conditional differential entropy of X„_i given X^~^ = (Xq, Xi, . . . , Xn- 2 ) 
is defined by 

- J . . . ,x„_i) X 

log /x„_i|Xi,... ,X„_2 {Xn-1 |xi, . . . , Xn-2) dxo dxi--- dXn-1 

(4.136) 

Show that 

h{X") = + h{X"-^). (4.137) 

Now suppose that {Xn} is a stationary Gaussian random process with 
zero mean and covariance function K. Evaluate /i(Jf„|X”“^). 

40. Let X > 0 be an integer valued random variable with E{X) < 00. 

(a) Prove that 



E{X) = Y^P{X> k) 

k=l 



(b) Based on (a) argue that 

lim P{X >N)=Q 

N-^-OO 
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(c) Prove the stronger statement 

P{X >N)< ^ 



Hint: Write an expression for the expectation E{X) and break 
up the sum into two parts, a portion where the summation 
dummy variable is larger than N and a portion where it is 
smaller. A simple lower bound for each part gives the desired 
result. 

(d) Let A be a geometric random variable with parameter p, p ^ 0. 
Calculate the quantity P{X > N) and use this result to show 
that actually lim^v^oo P{X > iV) = 0. 

(e) Based on the previous parts show that 







for any 0 < p < 1 and for any integer N . 



41. Suppose that {A„} is an iid random process with mean A(A„) = X 
and variance A[(A„ — A)^] = cr^. A new process {A„} is defined by 
the relation 

OO 

r„ = ^ r"=A„_fc 

where |r| < 1. Find E{Yn) and the autocorrelation RY{k,j) and the 
covariance KyikO)- 
Define the sample average 



n— 1 



5„ = - Va*. 

71 



i^O 



Find the mean E{Sn) and variance cr|^. Does S'„ — > 0 in probability? 



42. Let {[/„} be an iid Gaussian random process with mean 0 and variance 
. Suppose that Z is a random variable having a uniform distribution 
on [0, 1]. Suppose Z represents the value of a measurement taken by 
a remote sensor and that we wish to guess the value of Z based on 
a noise sequence of measurements A„ = Z + [/„, n = 0, 1, 2, . . . , that 
is, we observe only A„ and wish to estimate the underlying value of 
Z. To do this we form a sample average and define the estimate 
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(a) Find a simple upper bound to the probability 

Pr(|Z„ - Z| > e) 

that goes to zero as n ^ oo. (This means that our estimator is 
asymptotically good.) 



Suppose next that we have a two-dimensional random process 
{UmWn\ (i.e., the output at each time is a random pair or 
a two-dimensional random variable) with the following prop- 
erties: Each pair (Un,W„) is independent of all past and fu- 
ture pairs (Uk,Wk) k ^ n. Each pair {Un,W„) has an iden- 
tical joint cdf Fu^w{u,w). For each n EUn = EWn = 0, 

E(UnWn) = p<J^ ■ (The quantity 
p is called the correlation coejficient.) Instead of just observing 
a noisy sequence = Z we also observe a separate noisy 
measurement sequence Xn = Z + IF„ (the same Z, but different 
noises). Suppose further that we try to improve our estimate of 
Z by using both of these measurements to form an estimate 



n— 1 



n—1 



Z — a — 'S^ Yi + (1 — a) — Xi . 

71 f ^ 71 f ^ 



2=0 



2 = 0 



for some a in [0, 1]. 



(b) Show that \p\ < 1. Find a simple upper bound to the probability 

Pr(|Z„ -Z\>e) 

that goes to zero as n ^ oo. What value of a gives the smallest 
upper bound in part (b) and what is the resulting bound? (Note 
as a check that the bound should be no worse than part (a) since 
the estimator of part (a) is a special case of that of part (b).) In 
the special case where p = —1, what is the best a and what is 
the resulting bound? 

43. Suppose that {X^} are iid random variables described by a common 
marginal distribution F. Suppose that the random variables 



1 

n 






1=0 



also have the distribution F for all positive integers n. Find the form 
of the distribution F. (This is an example of what is called a stable 
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distribution. Suppose that the 1/2 in the definition of S'„ is replaced 
by lj\/n. What must F then be? 

44. Consider the following nonlinear modulation scheme: Define 

W{t) = el(2’r/ot+cX(t)+e)^ 

{X(t)} is a weakly stationary Gaussian random process with auto- 
correlation function Rx{t), /q is a fixed frequency, 0 is a uniform 
random variable on [0,27 t], 0 is independent of all of the X{t), and 
c is a modulation constant. (This is a mathematical model for phase 
modulation.) 

Define the expectation of a complex random variable in the natural 
way, that is, \i Z = iR{Z)+jSs{Z), then E{Z) = E[iR{Z)]+ j E[Ss{Z)].) 
Define the autocorrelation of a complex valued random process W (t) 

by 

Rw{t,s) = E{W{t)W{s)*), 

where W{s)* denotes the complex conjugate of W{s). 

Find the mean E{W{t)) and the autocorrelation function Rw{t, s) = 
E[W{t)W{s)*]. 

Hint: The autocorrelation is admittedly a trick question (but a very 
useful trick). Keep characteristic functions in mind when pondering 
the evaluation of the autocorrelation function. 

45. Suppose that {X^, n = 0,l,---}isa discrete time iid random process 
with pmf 

PxAk) = 1/2; fc = 0, 1. 

Two other random processes are defined in terms of the X process: 

n 

Yn = '^X^; n = 0, • • • 

i=0 

VF„ = (-1)^" n = 0,l,--- . 

and 

In — X.yi Xji—i^ n — 1, . . . . 

(a) Find the covariance functions for the X and Y processes. 

(b) Find the mean and variance of the random variable VF„. Find 
the covariance function of the process ]V„. 

(c) Find the characteristic function of the random variable V„.. 




274 



CHAPTER 4. EXPECTATION AND AVERAGES 



(d) Which of the above four processes are weakly stationary? Which 
are not? 

(e) Evaluate the following limits: 

i. l.i.m.^ — ^oo ■ 

ii. l.i.m.„^oo 

hi. l.i.m.„^oo i Ya=i 

iv. For the showoffs: Does the last limit above converge with 
probability one? (Only elementary arguments are needed.) 



46. Suppose that {Xn} is a discrete time iid random process with uniform 
marginal pdf’s 

0<a<l 
/x„(a) otherwise. 

Define the following random variables: 



• u = xl 

• E = max(Xi,X2,X3,X4) 







if > 2 X 2 
otherwise 



• For each integer n 



Yn — N„ + Xn-l- 

Note that this defines a new random process {F„}. 



(a) Find the expected values of the random variables U, V, and W. 

(b) What are the mean E{Xn) and covariance function Kx{k,j) of 

{X„}? 

(c) What are the mean E{Yn) and covariance function Kyikjj) of 

(d) Define the sample average 




Find the mean E{Sn) and variance cr|^ of S'„. Using only these 
results (and no results not yet covered in class), find l.i.m.„^ooS'„. 
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(e) Does the sequence of random variables 

n 

Zn = Y[x, 

i=l 



converge in probability to 0? 

47. A discrete time martingale martingale {¥„ n = 0.1.2. . . . } is a process 
with the property that 

e[y^\Yq,y^, . . . = y „. 

In words, the conditional expectation of the expectation of the current 
value is the previous value. Suppose that {A„} is iid. Is 

n—1 

n—0 



a martingale? 

48. Let {Yn} be the one-dimensional random walk of chapter 3. 

(a) Find the pmf py^ for n = 0, 1, 2. 

(b) Find the mean A[F„] and variance CTy^. 

(c) Does Yn/n converge as n gets large? 

(d) Find the conditional pmf’s py„|Yj,,Yi,... ,y„_i (j/n|yo, J/i, ■ • ■ ,J/n-i) 
and Py„|y„_i(2/n|?/n-i)- Is this process Markov? 

(e) What is the minimum MSE estimate of given E„_i? What is 
the probability that which actually equal its minimum MSE 
estimate? 

49. Let {Xn} be a binary iid process with px{El) = 0.5. Define a new 
process {IF„; n = 0, 1, . . . } by 

W„ = A„ + A„_i. 

This is an example of a moving average process, so-called because it 
computes a short term average of the input process. Find the mean, 
variance, and covariance function of {VF„}. Prove a weak law of large 
numbers for kF„. 
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50. How does one generate a random process? It is often of interest to 
do so in order to simulate a physical system in order to test an al- 
gorithm before it is applied to genuine data. Using genuine physical 
data may be too expensive, dangerous, or politically risky. One might 
connect a sensor to a resistor and heat it up to produce thermal noise, 
or flip a coin a few million times. One solution requires uncommon 
hardware and the other physical effort. The usual solution is to use 
a computer to generate a sequence that is not actually random, but 
pseudo random in that it can produce a long sequence of numbers 
that appear to be random and which will satisfy several tests for ran- 
domness, provided that the tests are not too stringent. An example is 
the rand command used in Matlab"'"'^ . It uses the linear congruential 
method which starts with a “seed” Xq and then recursively defines 
the sequence 

A„ = {fXn-i) mod (231 _ 

This produces a sequence of integers in the range from 0 to 2^i — 
1. Dividing by 2^i (which is just a question of shifting in binary 
arithmetic) produces a number in the range [0, 1). Find a computer 
with Matlab or program this algorithm yourself and try it out with 
different starting sequences. Find the sample average of a sequence 
of 100, 1000, and 10000 samples and compare them to the expected 
value of the uniform pdf random variable considered in this chapter. 
How might you determine whether or not the sequence being viewed 
was indeed random or not if you did not know how it was generated? 

51. Suppose that C/ is a random variable with pdf fu{u) = 1 for u G [0, 1). 
Describe a function q : [0, 1) ^ A, where A = {0, ai, . . . , AT — 1, so 
that the random variable X = q{U) is discrete with pmf 

Px{k) = ^; A: = 0, 1,. . . , a: - 1. 

You have produced a uniform discrete random variable from a uniform 
continuous random variable. 

(a) What is the minimum mean squared error estimator of U given 
X = kl Call this estimator U{k). Write an expression for the 
resulting MSE 

E[{U-U{q{U))f. 

(b) Show that the estimator U found in the previous part minimizes 
the MSE E[{U — U{q{U))'^] between the original input and the 
final output (assuming that q is fixed). You have just demon- 
strated one of the key properties of a Lloyd-Max quantizer. 
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(c) Find the pmf for the random variable U = U{q{U)). Find E[U] 
and How do the mean and variance of the U compare with 
those of in (Le., equal, bigger, smaller?) 

52. Modify the development in the text for the minimum mean squared 
error estimator to work for discrete random variables. What is the 
minimum MSE estimator for given Y„_i for the binary Markov 
process developed in the chapter? Which do you think makes more 
sense for guessing the next outcome for a binary Markov process, the 
minimum probability of error classifier or the minimum MSE estima- 
tor? Explain. 

53. Let {Y„; n = 0, 1, ... } be the binary Markov process developed in 
the chapter. Find a new process {kF„; n = 1,2,...} defined by 
Wn = Yn(B Yn-i. Describe the process kF„. 

54. (Problem courtesy of the ECE Department of the Technion.) Let X 
be a Gaussian random variable with zero mean and variance <t^. 

(a) Find E[cos(nX)], n = 1, 2, 

(b) Find E[X%n= 1,2,.... 

(c) Let be a Poisson random variable with parameter A and as- 
sume that X and N are independent. Find E[X'^]. 

Hint: Use characteristic functions and iterated expecttation. 

55. (Problem courtesy of the ECE Department of the Technion.) Let 
X be a random variable with uniform pdf on [—1, 1]. Define a new 
random variable Y by 



Y = 



a: a: < 0 
1 a: > 0 



(a) Find the cdf Fyiy) and plot it. 

(b) Find the pdf fyiv)- 

(c) Find E{Y) and ay. 

(d) Find E{X\Y). 

(e) Find E[(X-E(X|r))2]. 

56. (Problem courtesy of the ECE Department of the Technion.) Let 
Xi,X 2 , . . . ,Xn be zero mean statistically independent random vari- 
ables. Define 

n 
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Find E(YrlYi,Y 2 ,Y 3 ). 

57. (Problem courtesy of the ECE Department of the Technion.) Let U 
denote a binary random variable with pmf pu(u) = .5 for u = ±1. Let 
Y = U+X, where X is Af{0, and where U and X are independent. 
Find E{U\Y). 

58. (Problem courtesy of the ECE Department of the Technion.) Let 
{Xn~, n = 1, 2, . . . } be an iid sequence with mean 0 and unit variance. 
Let K he & discrete random variable, independent of the which 
has a uniform pmf on {1, 2, . . . , 16}. Define 

n 

(a) Find E{Y) and CTy. 

(b) Find the optimal linear estimator in the MSE sense of Xi given 

Y and calculate the resulting MSE. 

(c) Find the optimal linear estimator in the MSE sense of K given 

Y and calculate the resulting MSE. 

59. (Problem courtesy of the ECE Department of the Technion.) Let 
Y, N\, N 2 be zero mean, unit variance, mutually independent random 
variables. Define 

= y + iVi + v^fVi 
^2 = y + 3fVi + v^fVi. 

(a) Find the linear MMSE estimator of Y given X\ and X 2 - 

(b) Find the resulting MSE. 

(c) For what value of a G [0, 00 ) does the mean squared error become 
zero? Provide an intuitive explanation. 

60. (Problem courtesy of the ECE Department of the Technion.) Let 
{Xn, n= 1, 2, . . . } be an iid sequence of Af{m, <j^) random variables. 
Define for any positive integer N 



N 

Sn = ^ Xn- 

n—1 



(a) For iL < iV find the pdf /s„.Sk («> /^)- 
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(b) Find the MMSE estimator of Sk given Sjv, E(SkIS]v). Define 
Vk = ^n- Find the MMSE of Vk given Vn. 

61. (Problem courtesy of the ECE Department of the Technion.) Let 
Xi = S +Wi, i = where S and the Wi are mutually inde- 

pendent with zero mean. The variance of S is as and the variances 
of all the Wi are all cr^. 

(a) Find the linear MMSE of S given the observations Xj, i = 
1,2,... ,iV. 

(b) Find the resulting MSE. 
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Chapter 5 

Second- Order Moments 



In chapter 4 we have seen that the second-order moments of a random 
process — the mean and covariance or, equivalently, the autocorrelation 
— play a fundamental role in describing the relation of limiting sample 
averages and expectations. We have also seen, e.g., in Section 4.5.1 and 
problem 4.23, and we shall see again that these moments also play a key role 
in signal processing applications of random processes, linear least squares 
estimation in particular. Because of the fundamental importance of these 
particular moments, this chapter considers their properties in greater depth 
and their evaluation for several important examples. A primary focus is on 
a second-order moment analog of a derived distribution problem: suppose 
that we are given the second-order moments of one random process and 
that this process is then used as an input to a linear system; what are 
the resulting second-order moments of the output random process? These 
results are collectively known as second-order moment input/output or I/O 
relations for linear systems. 

Linear systems may seem to be a very special case. As we will see, 
their most obvious attribute is that they are easier to handle analytically, 
which leads to more complete, useful, and stronger results than can be 
obtained for the class of all systems. This special case, however, plays a 
central role and is by far the most important class of systems. The design of 
engineering systems frequently involves the determination of an optimum 
system — perhaps the optimum signal detector for a signal in noise, the 
filter that provides the highest signal-to-noise ratio, the optimum receiver, 
etc. Surprisingly enough, the optimum is frequently a linear system. Even 
when the optimum is not linear, often a linear system is a good enough 
approximation to the optimal system so that a linear system is used for 
the sake of economical design. For these reasons it is of interest to study 
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the properties of the output random process from a linear system that is 
driven by a specified input random process. In this chapter we consider only 
second-order moments; in the next chapter we consider examples in which 
one can develop a more complete probabilistic description of the output 
process. As one might suspect the less complete second-order descriptions 
are possible under far more general conditions. 

With the knowledge of the second-order properties of the output process 
when a linear system is driven by a given random process, one will have the 
fundamental tools for the analysis and optimization of such linear systems. 
As an example of such analysis, the chapter closes with an application of 
second-order moment theory to the design of systems for linear least squares 
estimation. 

Because the primary engineering application of these systems is to noise 
discrimination, we will group them together under the name “linear filters.” 
This designation denotes the suppression or “filtering out” of noise from the 
combination of signal and noise. The methods of analysis are not limited 
to this application, of course. 

As usual, we emphasize discrete time in the development, with the ob- 
vious extensions to continuous time provided by integrals. Furthermore, we 
restrict attention in the basic development to linear time-invariant filters. 
The extension to time-varying systems is obvious but cluttered with ob- 
fuscating notation. Time-varying systems will be encountered briefly when 
considering recursive estimation. 



5.1 Linear Filtering of Random Processes 

Suppose that a random process {X{t)] t G T}, (or {Xt; t G T}) is used 
as an input to a linear time-invariant system described by a 5 response h. 
Hence the output process, say {Y (t)} or {Yt} is described by the convolution 
integral of (A. 22) in the continuous time case of the convolution sum of 
(A. 29) in the discrete time case. To be precise, we have to be careful about 
how the integral or sum is defined; that is, integrals and infinite sums of 
random processes are really limits of random variables, and those limits can 
converge in a variety of ways, such as quadratic mean or with probability 
one. For the moment we will assume that the convergence is pointwise 
(that is, with probability one), i.e., that each realization or sample function 
of the output is related to the corresponding realization of the input via 
(A. 22) or (A. 29). That is, we take 



y(t) = [ 

J s: t — s^T 



X{t — s)h{s) ds 



(5.1) 
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or 

^ X^.khk (5.2) 

k-.n—k^T 



to mean actual equality for all elementary events u) on the underlying prob- 
ability space n. More precisely, 



Y{t,uj) = 



X{t — s,uj)h{s) ds 



' s: t — sG'T 



or 

k-.n—k^T 

respectively. Rigorous consideration of conditions under which the various 
limits exist is straightforward for the discrete time case. It is obvious that 
the limits exist for the so-called finite impulse response (FIR) discrete time 
filters where only a finite number of the hk are nonzero and hence the 
sum has only a finite number of terms. It is also possible to show mean 
square convergence for the general discrete time convolution if the input 
process has finite mean and variance and if the filter is stable in the sense 
of (A. 30). In particular, for a two-sided input process, (5.2) converges in 
quadratic mean; i.e., 

n — 1 

l.i.m. Xn-khk 

exists for all n. Convergence with probability 1 can be established using 
more advanced methods provided sufficient technical conditions are satis- 
fied. The theory is far more complicated in the continuous time case. As 
usual, we will by and large ignore these problems and just assume that the 
convolutions are well defined. 

Unfortunately, (A. 24) and (A. 30) are not satisfied in general for sample 
functions of interesting random processes and hence in general one cannot 
take Fourier transforms of both sides of (5.1) and (5.2) and obtain a useful 
spectral relation. Even if one could, the Fourier transform of a random 
process would be a random variable for each value of frequency! Because 
of this, the frequency domain theory for random processes is quite different 
from that for deterministic processes. Relations such as (A. 26) may on 
occasion be useful for intuition, but they must be used with extreme care. 

With the foregoing notation and preliminary considerations, we now 
turn to the analysis of discrete time linear filters with random process 
inputs. 
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5.2 Second-Order Linear Systems I/O Rela- 
tions 

Discrete Time Systems 

Ideally one would like to have a complete specification of the output of a 
linear system as a function of the specification of the input random pro- 
cess. Usually this is a difficult proposition because of the complexity of 
the computations required. However, it is a relatively easy task to deter- 
mine the mean and covariance function at the output. As we will show, 
the output mean and covariance function depend only on the input mean 
and covariance function and on no other properties of the input random 
process. Furthermore, in many, if not most, applications, the mean and 
covariance functions of the output are all that are needed to solve the prob- 
lem at hand. As an important example: if the random process is Gaussian, 
then the mean and covariance functions provide a complete description of 
the process. 

Linear filter input/output (I/O) relations are most easily developed us- 
ing the convolution representation of a linear system. Let {Ai„} be a dis- 
crete time random process with mean function m„ = EX„ and covariance 
function Kx{n, k) = E[(Xn — — ruk)]- Let {hk} be the Kronecker 

(5 response of a discrete time linear filter. For notational convenience we 
assume that the <5 response is causal. The non-causal case simply involves a 
change of the limits of summation. Next we will find the mean and covari- 
ance functions for the output process {Yn} that is given in the convolution 
equation of (5.2). 

From (5.2) the mean of the output process is found using the linearity 
of expectation as 



EYji — ^ ) hkEX^—k — ^ ( hk’ITln—k 5 (^-^) 

k k 

assuming, of course, that the sum exists. The sum does exist if the filter 
is stable and the input mean is bounded. That is, if there is a constant 
m < oo, such that |m„| < \m\ for all n and if the filter is stable in the sense 
of equation (A. 30), then 

\EYn\ = l^hfcTOm-fel < maxfc 

k k 

< \m\^\hk\ < oo . 

k 



If the input process {Ai„} is weakly stationary, then the input mean function 
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equals the constant, m, and 



EYn = my^ hk , (5.4) 

k 

which is the dc response of the filter times the mean. For reference we 
specify the precise limits for the two-sided random process where T = Z 
and for the one-sided input random process where T = 

OO 

EYr, = mY,hk . T = Z (5.5) 



EYn = m'^hk j T = . (5.6) 



Thus, for weakly stationary input random processes, the output mean 
exists if the input mean is finite and the filter is stable. In addition, it 
can be seen that for two-sided weakly stationary random processes, the 
expected value of the output process does not depend on the time index 
n since then the limits of the summation do not depend on n. For one- 
sided weakly stationary random processes, however, the output mean is 
not constant with time but approaches a constant value as n ^ oo if the 
filter is stable. Note that this means that if a one-sided stationary process 
is put into a linear filter, the output is in general not stationary! 

If the filter is not stable, the magnitude of the output mean is unbounded 
with time. For example, if we set hfe = 1 for all k in (5.6) then EYn = 
(n -|- l)m, which very strongly depends on the time index n and which is 
unbounded. 

Turning to the calculation of the output covariance function, we use 
equations (5.2) and (5.3) to evaluate the covariance with some bookkeeping 
as 



KY{k,j) = E[{Yu - EYk){Y, - EY^)] 

— T/ I ^ '^ hn{Nf^—n ’^k—n) | I ^ — m TTlj — m) 

L \ n J \ m / 

— ^ ^ '^ hnhniE[(^Xji^—n n)(^j— m m)] 

n m 

= - n, j - to) . 



m 



(5.7) 
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A careful reader might note the similarity between (5.7) and the corre- 
sponding matrix equation (4.28) derived during the consideration of Gaus- 
sian vectors (but true generally for covariance matrices of linear functions 
of random vectors). 

As before, the range of the sums depends on the index set used. Since 
we have specified causal filters, the sums run from 0 to oo for two-sided 
processes and from 0 to fc and 0 to j for one-sided random processes. 

It can be shown that the sum of (5.7) converges if the filter is stable 
in the sense of (A. 30) and if the input process has bounded variance; i.e., 
there is a constant < oo such that \Kx{n,n)\ < for all n (problem 
5.19). 

If the input process is weakly stationary, then Kx depends only on the 
difference of its arguments. This is made explicit by replacing Kx{m,n) 
by Kx{m — n). Then (5.7) becomes 



KvikO) = EE hnhmKx{{k - j) - (n - m)) . 



(5.8) 



Specifying the limits of the summation for the one-sided and two-sided 
cases, we have that 

OO OO 

Ky{k,j) = EE j) 777 -)) ; '7^ — 2 . (^-9) 



n—0 m—0 



and 



KrikJ) = EE hnkmNx{{k j') (ti rn)) ; T — 



+ • 



(5.10) 



n—O m—0 



If the sum of (5.9) converges (e.g., if the filter is stable and Kx{n,n) = 
Kx{0) < oo), then two interesting facts follow: First, if the input random 
process is weakly stationary and if the processes are two-sided, then the 
covariance of the output process depends only on the time lag; i.e., Kyik, j) 
can be replaced by Ky{k — j). Note that this is not the case for a one-sided 
process, even if the input process is stationary and the filter stable! This 
fact, together with our earlier result regarding the mean, can be summarized 
as follows: 

Given a two-sided random process as input to a linear filter, if the input 
process is weakly stationary and the filter is stable, the output random pro- 
cess is also weakly stationary. The output mean and covariance functions 
are given by 



EY„ 



= my^hk 



k=0 



(5.11) 
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Kyik) = EE hnhmNx(k - (n - m)) . (5-12) 

n=0 m—0 

The second observation is that (5.8), (5.9), (5.10) or (5.12) is a double 
discrete convolution! The direct evaluation of (5.8), (5.9), and (5.10) while 
straightforward in concept, can be an exceedingly involved computation in 
practice. As in other linear systems applications, the evaluations of convo- 
lutions can often be greatly simplified by resort to transform techniques, as 
shall be considered shortly. 

Continuous Time Systems 

For each of the discrete time filter results there is an analogous continuous 
time result. For simplicity, however, we consider only the simpler case 
of two-sided processes. Let {X{t)} be a two-sided continuous time input 
random process to a linear time-invariant filter with impulse response h{t). 

We can evaluate the mean and covariance functions of the output pro- 
cess in terms of the mean and covariance functions of the input random 
process by using the same development as was used for discrete random 
processes. This time we will have integrals instead of sums. Let m(t) 
and Kx{t, s) be the respective mean and covariance functions of the input 
process. Then the mean function of the output process is 

EY{t) = J E[X {t — s)]h{s) ds = J m{t — s)h{s) ds . (5.13) 

The covariance function of the output random process is obtained by com- 
putations analogous to (5.7) as 

ATy(t, s) = J da J dfIKx{t — a, s — l3)h{a)h{l3) . (5.14) 

Thus if {A(t)} is weakly stationary with mean m = m(t) and covariance 
function Kx(t), then 



EY{t)=m J h{t) dt 



(5.15) 



and 



Kyit, s) 




d(iKx{{t — s) — {a — j3))h{a)h{[5) . 



(5.16) 



In analogy to the discrete time result, the output mean is constant for a 
two-sided random process, and the covariance function depends on only the 
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time difference. Thus a weakly stationary two-sided process into a stable 
linear time-invariant filter yields a weakly stationary output process in both 
discrete and continuous time. We leave it to the reader to develop conclu- 
sions that are parallel to the discrete time results for one-sided processes. 



Transform I/O Relations 

In both discrete and continuous time, the covariance function of the output 
can be found by first convolving the input autocorrelation with the pulse 
response hk or h{t) and then convolving the result with the reflected pulse 
response h-k or h(—t). A way of avoiding the double convolution is found 
in Fourier transforms. Taking the Fourier transform (continuous or dis- 
crete time) of the double convolution yields the transform of the covariance 
function, which can be used to arrive at the output covariance function — 
essentially the same result with (in many cases) less overall work. 

We shall show the development for discrete time, a similar sequence 
of steps provides the proof for continuous time by replacing the sums by 
integrals. Using (5.12), 



EfiKy) 

= EfEE hnhm^x{k ^ ^ ^ 



k \ n m 



- EE^ nhm Y^Kxik — (n — m))( 



-j2-n-f(k-(n-m)) \ -j2irf{n-m) 






T{Kx) 

\ n / \ m / 

= Ef{Kx)Tf{h)Tf{h*) , 



= 



(5.17) 



where the asterix denotes complex conjugate. If we define H{f) = Tf{h), 
the transfer function of the filter, then the result can be abbreviated for 
both continuous and discrete time as 



Tf{Ky) = \H{f)\^Tf{Kx). (5.18) 

We can also conveniently describe the mean and autocorrelation functions 
in the frequency domain. From (5.5) and (5.15) the mean my of the output 
is related to the mean my of the input simply as 



my = H{0)mx- 



(5.19) 
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Since Kx{k) = Rx{k) — and Kyik) = Ryik) — Im^P, (5.18) implies 

that 

Pf{Ry - \myn = \H{f)\^Pf{Rx - \mx\^) 



Pf{Ry)-\my\^6{0) = \H{f)mPj{Rx)-\mx\^S{0)) 

= \Hif)\^Pf{Rx)-\H{mmx\‘^6{0) 

= \H{f)fPj{Rx)-\H{0)f\mxfS{0), 



where we have used the property of Dirac deltas that g{f)S{f) = g{0)S{f) 
(provided g{f) has no jumps at / = 0). Thus the autocorrelation func- 
tion satisfies the same transform relation as the covariance function. This 
result is abbreviated by giving a special notation to the transform of an 
autocorrelation function: Given a weakly stationary process {X(t)} with 
autocorrelation function Rx, the power spectral density of the process is 
defined by 



SxU) = Pf{Rx) 



^Rx{k)e~^'^^f^ 
J dr 



discrete time 
continuous time . 

(5.20) 



the Fourier transform of the autocorrelation function. The reason for the 
name will be given in the next section and discussed at further length later 
in the chapter. Given the definition we have now proved the following 
result. 

If a weakly stationary process {X(t)} with power spectral density Sx{f) 
is the input to a linear time invariant filter with transfer function H , then 
the output process {Y{t)} is also weakly stationary and has mean 



my = H{0)mx 



and power spectral density 



Sy{f) = \H{f)fSx{f). 



(5.21) 

(5.22) 



This result is true for both discrete and continuous time. 



5.3 Power Spectral Densities 

Under suitable technical conditions the Fourier transform can be inverted to 
obtain the autocorrelation function from the power spectral density. Thus 
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the reader can verify from the definitions (5.20) that 



r-l/2 



Sx(f)ei^Mrdf 

Rx{t)={ 

J — oo 



discrete time, integer t 

continuous time, continuous r . 

(5.23) 



The limits of —1/2 to +1/2 for the discrete time integral correspond to the 
fact that time is measured in units; e.g., adjacent outputs are one second 
or one minute or one year apart. Sometimes, however, the discrete time 
process is formed by sampling a continuous time process at every, say, T 
seconds, and it is desired to retain seconds as the unit of measurement. 
Then it is more convenient to incorporate the scale factor T into the time 
units and scale (5.20) and the limits of (5.23) accordingly — i.e., kT replaces 
k in (5.20), and the limits become —lj2T to lj2T. 

Power spectral densities inherit the property of symmetry from autocor- 
relation functions. As seen from the definition in chapter 4, covariance and 
autocorrelation functions are symmetric (Rx(t,s) = i?x(s,t)). Therefore 
Rx{t) is an even function. From (5.20) it can be seen with a little juggling 
that Sxif) is also even; that is, Sx{—f) = Sx{f) for all /. 

The reason for the name “power spectral density” comes from observing 
how the average power of a random process is distributed in the frequency 
domain. The autocorrelation function evaluated at 0 lag, Px = Rx{0) = 
E{\X{t)\'^) can be interpreted as the average power dissipated in a unit 
resistor by a voltage X{t). Since the autocorrelation is the inverse Fourier 
transform of the power spectral density, this means that 



Px = I Sx{f)df, 



(5.24) 



that is, the total average power in the process can be found by integrating 
Sxif)- Thus if Sx were nonnegative, it could be considered as a density of 
power analogous to integrating a probability or mass density to find total 
probability or mass. For the probability and mass analogues, however, we 
know that integrating over any reasonable set will give the probability or 
mass of that set, i.e., we do not wish to confine interest to integrating over all 
possible frequencies. The analogous consideration for power is to look at the 
total average power within an arbitrary frequency band, which we do next. 
The fact that power spectral densities are nonnegative can be derived from 
the fact that the autocorrelation function is nonnegative definite (which 
can be shown in the same way it was shown for covariance functions) — 
a result known as Bochner’s theorem. We shall prove nonnegativity of the 
power spectral density as part of the development. 
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Suppose that we wish to find the power of a process, say {Xt} in some 
frequency band f G F. Then a physically natural way to accomplish this 
would be to pass the given process through a bandpass filter with transfer 
function H{f) equal to 1 for f G F and 0 otherwise and then to measure 
the output power. This is depicted in Figure 5.1 for the special case of 
a frequency interval F = {f : /o < |/| < /o + ^/}- Calling the output 






> H{.f) 



Yt 



HU) 




E[Y^]= j SxU)df. 



Figure 5.1: Power spectral density 
process {Yt}, we have from (5.24) that the output power is 

Ry{0) = J SY{f)df = I \H{f)\^SxU)df = J^SxU)df . (5.25) 

Thus to find the average power contained in any frequency band we in- 
tegrate the power spectral density over the frequency band. Because the 
average power must be nonnegative for any choice of /o and A/, it follows 
that any power spectral density must be nonnegative, i.e., 

SxU) > 0, all /. (5.26) 

To elaborate further, suppose that this is not true; i.e., suppose that Sx{f) 
is negative for some range of frequencies. If we put {Xt} through a filter 
that passes only those frequencies, the filter output power would have to 
be negative — clearly an impossibility. 

From the foregoing considerations it can be deduced that the name 
power spectral density derives from the fact that Sx{f) is a nonnegative 
function that is integrated to get power; that is, a “spectral” (meaning fre- 
quency content) density of power. Keep in mind the analogy to evaluating 
probability by integrating a probability density. 
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5.4 Linearly Filtered Uncorrelated Processes 

If the input process {Xn} to a discrete time linear filter with 6 response 
{hk} is a weakly stationary uncorrelated process with mean m and variance 
cr^ (for example, if it is iid), then Kx{k) = a'^Sk and Rx{k) = a'^Sk + m'^. 
In this case the power spectral density is easily found to be 

*S'x(/) = + = cr^ + m^(5(/) ; all/, 

k 

since the only nonzero term in the sum is the k = 0 term. The presence of 
the Dirac delta is due to the nonzero mean. When the mean is zero, this 
simplifies to 



Sx{f)=<J^,allf. (5.27) 

Because the power spectral density is flat in this case, in analogy to the flat 
electromagnetic spectrum of white light, such a process (a discrete time, 
weakly stationary, zero mean, uncorrelated process) is said to be white 
or white noise. The inverse Fourier transform of the white noise spectral 
density is found from (5.23) (or simply by uniqueness) to be Rx{k) = o’^dk- 
Thus a discrete time random process is white if and only if it is weakly 
stationary, zero mean, and uncorrelated. 

For the two-sided case we have from (5.12) that the output covariance 
is 



Kyik) = X! ^rthn-k = ^ hnhn-k T = Z , (5.28) 

n=0 n—k 

where the lower limit of the sum follows from the causality of the filter. 
If we assume for simplicity that m = 0, the power spectral density in this 
case reduces to 



Sy{f) = a^\H{f)\^. (5.29) 

For a one-sided process, (5.10) yields 

k 

Ky{k,j) = F! hnhn-{k-j) ; T = Z+ . (5.30) 

n— 0 

Note that if fc > j, then the sum can be taken over the limits n = k — j 
to k since causality of the filter implies that the first few terms are 0. If 
k < j, then all of the terms in the sum may be needed. The covariance for 
the one-sided case appears to be asymmetric, but recalling that hi is 0 for 
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negative I, we can write the terms of the sum of (5.30) in descending order 
to obtain 

a^{hkhj + hk_ihj-i + . . . + h^hj-k) 



if j > fc and 

a^{hkhj + hk-ihj-i + . . . + hk-jho) 

if j < k. By defining the function min(/c,j) to be the smaller of k and j, 
we can rewrite (5.30) in two symmetric forms: 



min{k,j) 

KY{k,j) = a^ ^ hk-nhj-n ; T = Z+ (5.31) 

n— 0 

and 

^v{kj = (7 ^ ^ hnhn-\-\k—j\ • (5.32) 

n— 0 

The one-sided process is not weakly stationary because of the distinct pres- 
ence of k and j in the sum, so the power spectral density is not defined. 

In the two-sided case, the expression (5.28) for the output covariance 
function is the convolution of the unit pulse response with its reflection h-k. 
Such a convolution between a waveform or sequence and its own reflection 
is also called a sample autocorrelation. 

We next consider specific examples of this computation. These examples 
point out how two processes — one one-sided and the other two sided — 
can be apparently similar and yet have quite different properties. 

[ 5 . 1 ] Suppose that an uncorrelated discrete time two-sided random process 
{Xn} with mean m and variance is put into a linear filter with 
causal pulse response hk = r^, k > 0, with |r| < 1. Let {T„} denote 
the output process, i.e.. 



OO 

y„ = ^r'=X„_fe. (5.33) 

k=0 

Find the output mean and covariance. 

From the geometric series summation formula, 

OO 

EiF = 



1-H ’ 
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and hence the filter is stable. From (5.4), (5.5), and (5.6) 



EYn = m ^ 
fc=0 



m 

1 — r 



; n G Z . 



From (5.28), the output covariance for nonnegative k is 



Krik) = 



j,n^n—k 



1—k 



= (T^r * 



n—k 



1 — 



using the geometric series formula. Repeating the development for 
negative k (or appealing to symmetry) we find in general the covari- 
ance function is 



Kyik) = (P 



Ak\ 



kez . 



1 — 

Observe in particular that the output variance is 



4 = Ky{0) = 



1 — r 



2 ■ 



As |r| ^ 1 the output variance grows without bound. However, as 
long as |r| < 1, the variance is defined and the process is clearly 
weakly stationary. 



The previous example has an alternative construction that demon- 
strates how two models that appear quite different can lead to the 
same thing. From (5.33) we have 

OO OO 

k^O 

OO OO 

fc=l k=0 

= A„, 

since the two sums are equal. This yields a difference equation relating 
the two processes, expressing the output process Y„ in a recursive 
form: 



Yn — Xn rYn_i. 



( 5 . 34 ) 
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Thus the new Y„ is formed by adding the new Xn to the previous 
This representation shows that in a sense the process represents 
the “new information” in the Y„ process. We will see in the next 
chapter that if X„ is actually iid and not just uncorrelated, this rep- 
resentation leads to a complete probabilistic description of the output 
process. The representation (5.34) is called a first-order autoregressive 
model for the process, in contrast to the ordinary convolution repre- 
sentation of (5.33), which is often called a moving average model. 

The output spectral density can be found directly by taking the 
Fourier transform of the output covariance as 



SyU) = 






1 — 

k— — oc) 



a summation that can be evaluated using the geometric series formula 
— first from 1 to oo and then from 0 to — oo — and then summing 
the two complex terms. The reader should perform this calculation 
as an problem. It is easier, however, to find the output spectral 
density through the linear system I/O relation. The transfer function 
of the filter is evaluated by a single application of the geometric series 
formula as 



H{f) = 

k=0 



1 

1 - 



Therefore the output spectral density from (5.22) is 



SyU) 



|1 — re“2'"’/|2 1 -I- r2 — 2r cos(27t/) 



By a quick table lookup the reader can verify that the inverse trans- 
form of the output spectral density agrees with the covariance func- 
tion previously found. 



[ 5 . 2 ] Suppose that a one-sided uncorrelated process {Xn} with mean m 
and variance is put into a one-sided filter with pulse response as 
in example [5.1]. Let {F„} be the resulting one-sided output process. 
Find the output mean and covariance. 

This time (5.6) yields 



EYn = m'y^r'^ 

k=0 



= m 



1 — r 
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from the geometric series formula. From (5.32) the covariance is 



kY{k,j) = ^ ,.2n+|fe-j| 

n=0 



= (7 



_ j..2(min(fc,i) + l) 
1 — 



Observe that since |r| < 1, if we let n ^ oo, then the mean of this 
example goes to the mean of the preceding example in the limit. Sim- 
ilarly, if one fixes the lag \k — j\ and lets k (and hence j) go to oo, then 
in the limit the one-sided covariance looks like the two-sided example. 
This simple example points out a typical form of non-stationarity: A 
linearly filtered uncorrelated process is not stationary by any defini- 
tion, but as one gets farther and farther from the origin, the param- 
eters look more and more stationary. This can be considered as a 
form of asymptotic stationarity. In fact, a process is defined as being 
asymptotically weakly stationary if the mean and covariance converge 
in the sense just given. One can view such processes as having tran- 
sients that die out with time. It is not difficult to show that if a 
process is asymptotically weakly stationary and if the limiting mean 
and covariance meet the conditions of the ergodic theorem, then the 
process itself will satisfy the ergodic theorem. Intuitively stated, tran- 
sients do not affect the long-term sample averages. 



[ 5 . 3 ] Next consider the one-sided process of example [5.2], but now choose 
the pulse response with r = 1; that is, hk = I for all k > 0. Find 
the output mean and covariance. (Note that this filter is not stable.) 
Applying (5. 4-5. 6) and (5.28), (5.30), and (5.31) yields 

n 

EYn = m hk = m{n 1) 
k—0 



and 



Kyikjj) = CT^(min(/c, j) -|- 1) = cr^ mm{k -|- 1, j -|- 1) . 

Observe that like example [5.2], the process of example [5.3] is not 
weakly stationary. Unlike [5.2], however, it does not behave asymptoti- 
cally like a weakly stationary process — even for large time, the moments 
very much depend on the time origin. Thus the non-stationarities of this 
process are not only transients — they last forever! In a sense, this process 
is much more non-stationary than the previous one and, in fact, does not 
have a mean ergodic theorem. If the input process is Gaussian with zero 
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mean, then we shall see in chapter 6 that the output process {Yn} is also 
Gaussian. Such a Gaussian process with zero mean and with the covariance 
function of this example is called the discrete time Wiener process. 

[5.4 ] A Binary Markov Process 

The linear filtering ideas can be applied when other forms of arith- 
metic than real arithmetic or used. Rather than try to be general 
we illustrate the approach by an example, a process formed by linear 
filtering using binary (modulo 2) arithmetic an iid sequence of coin 
flips. 

Given a known input process and a Alter (a modulo 2 linear recursion 
in the present case). And the covariance function of the output. Gen- 
eral formulas will be derived later in the book, here a direct approach 
to the problem at hand is taken. 

First observe that A'y(fc,j) = E[{Yk — E{Yk)){Yj — E{Yj))] is easily 
evaluated for the case k = j because the marginal for Y^, is equiprob- 
able: 



E[Yk] 



'^ypYiy) 

y 



\iO + l) = 



1 

2 



KY{k,k) 



4 = A[(n-i)2] 
“ \ fpY{y) 



Next observe that a covariance function is symmetric in the sense 
that 



KY{k,j) = E[{Yk-E{Yu)){Y,-E{Y,))\ 
= E[{Y^-E{Y,)){Yu-E{Yk))] 
= KY{j,k) 



so that we will be done if we evaluate KY{k,j) for the special case 
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where k = j -\- 1 for I > 1. Consider therefore 

x,y,z 

'^{x®y- ^){z- ^)px^+i{x)pY^^i_^,Yj{y,z) 

x, y,z 

^ no©?/ - ^)(z- ^)(l 

y, z ^ 

+{l®y- \){z- ^)pPF^+i_i©,(2/n) 

Since 0 (B y = y and 1 (B y = I — y, this becomes 
KyU + IJ) = (1 

y,2 

y,z 

= {l-2p)Kr{j + l-l,jy, 1 = 1,2,... 

This is a simple linear difference equation with initial condition Ky{j, j) 
and hence the solution is 

KY{j + i,j) = a-2pyKY{j,j) = \{i-2py-, 1 = 1,2,.... ( 5 . 35 ) 

(Just plug it into the difference equation to verify that it is indeed a 
solution.) Invoking the symmetry property the covariance function is 
given by 

KY{k,j) = =ify(fc-j). (5.36) 

Note that it'y(fc) is absolutely summable (use the geometric progres- 
sion) so that the weak law of large numbers holds for the process. 



KyU + IJ) = 



5.5 Linear Modulation 

In this section we consider a different form of linear system: a linear mod- 
ulator. Unlike the filters considered thus far, these systems are generally 
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time-varying and contain random parameters. They are simpler than the 
general linear filters, however, in that the output depends on the input in 
an instantaneous fashion; that is, the output at time t depends only on the 
input at time t and not on previous inputs. 

In general, the word modulation means the methodical altering of one 
waveform by another. The waveform being altered is often called a carrier, 
and the waveform or sequence doing the altering, which we will model as a 
random process is called the signal. Physically, such modulation is usually 
done to transform an information-bearing signal into a process suitable for 
communication over a particular medium; e.g., simple amplitude modula- 
tion of a carrier sinusoid by a signal in order to take advantage of the fact 
that the resulting high-frequency signals will better propagate through the 
atmosphere than will audio frequencies. 

The emphasis will be on continuous time random processes since most 
communication systems involve at some point such a continuous time link. 
Several of the techniques, however, work virtually without change in a 
discrete environment. 

The prime example of linear modulation is the ubiquitous amplitude 
modulation or AM used for much of commercial broadcasting. If {A(t)} is 
a continuous time weakly stationary random process with zero mean and 
covariance function Kx{t), then the output process 

Y{t) = (oo -I- aiA(t)) cos(27r/t -I- 9) (5.37) 

is called amplitude modulation of the cosine by the original process. The 
parameters Qq and Oi are called modulation constants. Observe that linear 
modulation is not a linear operation in the normal linear systems sense 
unless the constant oq is 0. (It is, however, an affine operation — linear in 
the sense that straight lines in the two-dimensional x — y space are said to 
be linear. Nonetheless, as is commonly done, we will refer to this operation 
as linear modulation. 

The phase term 9 may be a fixed constant or a random variable, say 
0. (We point out a subtle source of confusion here: If 0 is a random 
variable, then the system is affine or linear for the input process only when 
the actual sample value, say 0, of 0 is known.) We usually assume for 
convenience that 0 is a random variable, independent of the X process 
and uniformly distributed on [0, 27 t] — one complete rotation of the carrier 
phaser in the complex plane. This is a mathematical convenience, that, 
as we will see, makes Y (t) weakly stationary. Physically it corresponds to 
the simple notion that we are modeling the modulated waveform as seen 
by a receiver. Such a receiver will not know a priori the phase of the 
transmitter oscillator producing the sinusoid. Furthermore, although the 
transmitted phase could be monitored and related to the signal as part of 
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the transmission process, this is never done with AM. Hence, so far as the 
receiver is concerned, the phase is equally likely to be anything; that is, it 
has a uniform distribution independent of the signal. 

If oo = 0, the modulated process is called double sideband suppressed 
carrier (DSB or DSB-SC). The oq term clearly wastes power, but it makes 
the easier and cheaper recovery or demodulation of the original process, as 
explained in any text on elementary communication theory. Our goal here 
is only to look at the second-order properties of the AM process. 

Observe that for any fixed phase angle, say 0 = 0 for convenience, a 
system taking a waveform and producing the DSB modulated waveform is 
indeed linear in the usual linear systems sense. It is actually simpler than 
the output of a general linear filter since the output at a given time depends 
only on the input at that time. 

Since 0 and the X process are independent, we have that the mean of 
the output is 



EY{t) = (oo -k aiEX{t))Ecos{2TTft + 0) . 

But 0 is uniformly distributed. Thus for any fixed time and frequency, 

/^ 27 t in 

E cos{2tt ft Q) = / cos(27r/t -k 0) — 

Jo 27 t 



1 

= — / cos(27r/t 9)d9 = 0 (5.38) 

27t Jq 

since the integral of a sinusoid over a period is zero; hence EY(t) = 0 
whether or not the original signal has zero mean. 

The covariance function of the output is given by the following expansion 
of the product Y{t)Y{s) using (5.37): 

KY{t,s) = agi?[cos(27r/t -k 0) cos(27t/s - k 0)] 

-\-aQai{EX{t)E[cos{2'K ft -k 0) cos{2tt fs -k 0)] 
-\-aoaiEX{s)E[cos(2Trft -k 0) cos( 27 t/s -k 0)]) 
-\-alKx{t, s)A[(cos27r/t -k 0 )(cos 27 t/s -k 0)] . 

Using the fact that the original process has zero mean eliminates the middle 
lines in the preceding. Combining the remaining two terms and using the 
cosine identity 
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yields 



Ky{t, s) 

= {al + ajKx{t,s)) X 



^ E cos{2Trf{t + s) + 20) + ^ E cos{2Trf{t 




Exactly as in the mean computation of (5.38), the expectation of the term 
with the 0 in it is zero, leaving 

Ky(t) = + alKxir)) cos{2TrfT) . 

Thus we have demonstrated that amplitude modulation of a carrier by 
a weakly stationary random process results in an output that is weakly 
stationary. 

The power spectral density of the AM process that we considered in 
the section on linear modulation can be found directly by transforming the 
covariance function or by using standard Fourier techniques: The transform 
of a covariance function times a cosine is the convolution of the original 
power spectral density with the generalized Fourier transform of the cosine 
— that is, a pair of impulses. This yields a pair of replicas of the original 
power spectral densities, centered at plus and minus the carrier frequency 
/o and symmetric about /o, as depicted in Figure 5.2. 

If further filtering is desired, e.g., to remove one of the symmetric halves 
of the power spectral density to form single sideband modulation, then the 
usual linear techniques can be applied, as indicated by (5.22). 



5.6 White Noise 

Let {Xn} be an uncorrelated weakly stationary discrete time random pro- 
cess with zero mean. We have seen that for such a process the covariance 
function is a pulse at the origin; that is, 

Kx{t) = a x6r , 

where St is a Kronecker delta function. As noted earlier, taking the Fourier 
transform results in the spectral density 

Sxif) = ai ; all / , 

that is, the power spectral density of such a process is flat over the entire 
frequency range. We remarked that a process with such a flat spectrum is 
said to be white. We now make this definition formally for both discrete 
and continuous time processes: 




302 



CHAPTER 5. SECOND-ORDER MOMENTS 



Xt 






Oi 02 



(J) 



cos(27t 



Y{t) = (oq + aiX{t)) cos(27r/oi + 0) 
fot + 0) 



Sxif) 
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-W^O w 



f 



Svif) 
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Figure 5.2: AM power spectral density 



A random process {Xt} is said to be white if its power spectral density 
is a constant for all /. (A white process is also almost always assumed to 
have a zero mean, an assumption that we will make.) 

The concept of white noise is clearly well defined and free of analytical 
difficulties in the discrete time case. In the continuous time case, however, 
there is a problem if white noise is defined as a process with constant power 
spectral density for all frequency. Recall from (5.24) that the average power 
in a process is the integral of the power spectral density. In the discrete 
time case, integrating a constant over a finite range causes no problem. In 
the continuous time case, we find from (5.24) that a white noise process has 
infinite average power. In other words, if such a process existed, it would 
blow up the universe! A quick perusal of the stochastic systems literature 
shows, however, that this problem has not prevented continuous time white 
noise process models from being popular and useful. The resolution of 
the apparent paradox is fairly simple: Indeed, white noise is a physically 
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impossible process. But there do exist noise sources that have a flat power 
spectral density over a range of frequencies that is much larger than the 
bandwidths of subsequent Alters of measurement devices. In fact, this is 
exactly the case with the thermal noise process caused by heat in resistors in 
amplifier circuits. A derivation based on the physics of such a process (see 
chapter 6) yields covariance function of the form Kx{t) = kTRae~°‘^'^^, 
where k is Boltzman’s constant, T is the absolute temperature, and R and 
a are parameters of the physical medium. The application of (5.20) results 
in the power spectral density 

2a‘^ 

Sxif) = kTR . 

a‘‘ + {2TTjy 

As a ^ oo, the power spectral density tends toward the value 2kTR for all 
/; that is, the process looks like white noise over a large bandwidth. Thus, 
for example, the total noise power in a bandwidth {—B, B) is approximately 
2kTR X 2B, a fact that has been verified closely by experiment. 

If such a process is put into a Alter having a transfer function whose 
magnitude become negligible long before the power spectral density of the 
input process decreases much, then the output process power spectral den- 
sity Syif) = \H{f)\‘^Sx{f) will be approximately the same at the output 
as it would have been if Sx{f) were flat forever since Sx{f) is flat for all 
values of / where \H{f)\ is non negligible. Thus, so far as the output pro- 
cess is concerned the input process can be either the physically impossible 
white noise model or a more realistic model with finite power. However, 
since the input white noise model is much simpler to work with analytically, 
it is usually adopted. 

In summary, continuous time white noise is often a useful model for the 
input to a Alter when we are trying to study the output . Commonly the in- 
put random process is represented as being white with flat spectral density 
equal to Nq/2. The factor of 2 is included because of the “two-sided” na- 
ture of Alter transfer functions; viz. a low pass Alter with cutoff frequency 
B applied to the white noise input will have output power equal to NqB 
in accordance with (5.25). Such a white noise process makes mathematical 
sense, however, only if seen through a Alter. The process itself is not rigor- 
ously defined. Its covariance function, however, can be represented in terms 
of a Dirac delta function for the purposes of analytical manipulations. Note 
that in (5.23) the generalized Fourier transform of the flat spectrum results 
in a Dirac delta function of unit impulse. In particular, if the continuous 
time white noise random process has power spectral density 

2 ’ 



Sxif) 
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then it will have a covariance or autocorrelation function 

Kx{t) = ^S{r) ■ 

Thus adjacent samples of the random process are uncorrelated (and hence 
also independent if the process is Gaussian) no matter how close together 
in time the samples are! At the same time, the variance of a single sample 
is infinite. Clearly such behavior is physically impossible. It is reasonable, 
however, to state qualitatively that adjacent samples are uncorrelated at 
all times greater than the shortest time delays in subsequent filtering. 

Perhaps the nicest attribute of white noise processes is the simple form 
of the output power spectral density of a linear filter driven by white noise. 
If a discrete or continuous time random process has power spectral density 
Sx (/) = Aq/ 2 for all / and it is put into a linear filter with transfer function 
H{f), then from (5.22) the output process {Yt} has power spectral density 

SY{f) = \H{f)\^^ . (5.40) 

The result given in (5.40) is of more importance than first appearances 
indicate. A basic result of the theory of weakly stationary random pro- 
cesses, called the spectral factorization theorem, states that if a random 
process {Yt} has a spectral density Syif) such that 



J In Syif) df > — oo (discrete time) 



(5.41) 



or 



In Sy{f) 
1 + /2 



df > — oo 



(continuous time) , 



(5.42) 



then the power spectral density has the form of (5.40) for some causal linear 
stable time-invariant filter. That is, the second-order properties of any 
random process satisfying these conditions can be modeled as the output 
of a causal linear filter driven by white noise. Such random processes are 
said to be physically realizable and comprise most random processes seen 
in practice. The conditions (5.41-5.42) are referred to as the Paley-Wiener 
criteria[57]. This result is of extreme importance in estimation, detection, 
prediction, and system identification. We note in passing that in such 
models the white noise driving process is called the innovations process of 
the output process if the filter has a causal and stable inverse. 

As a word of caution, there do exist processes which are not “physically 
realizable” in the above sense of violating the Paley-Wiener criterion (5.41 




5.7. *TIME-AVERAGES 



305 



- 5.42), yet which are still “physically realizable” in the sense that simple 
models describe the processes. Consider the following example suggested 
to the authors by A.B. Balakrishnan: Let A be a zero mean Gaussian 
random variable with variance 1 and let 0 be a random variable with a 
uniform distribution on [— 7r,7r) which is independent of X. Define the 
random process Y = cos{Xt — 0). Then analogous to the development of 
the autocorrelation function for linear modulation, we have that 

E[Y{t)] = 

Ry{t) = 



so that the power spectral density is 

SyU) = (5.43) 

which fails to meet the Paley-Wiener criterion. 



E[cos{Xt — 0)] 

0 

E[cos{Xt — 0) cos{X{t — t) — 0)] 
^E[cos{Xt)] 

^{Mxijr) + Mxi-jr) 




5.7 ATime- Averages 

Recall the definitions of mean, autocorrelation, and covariance as expecta- 
tions of samples of a weakly stationary random process {A„; n G 2}: 

m = E[Xn] 

Rx{k) = E[X^X*^_^\ 

Kx{k) = E[{X^ - m){Xr,-k - m)*] 

= Rx{k)-\m\'^. 



These are collectively considered as the second-order moments of the pro- 
cess. The corresponding time-average moments can be described if the 
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limits are assumed to exist in some suitable sense: 

1 w-i 

M = < >= lim — Xn 

N^oo N ^ 

n=0 

N-l 

Tlx{k) = > lim 

N^oo ^ ' 
k=0 

I N-l 

ICx{k) = < {Xn - m){Xn-k - m)* >= lim — (X„ - - m 

A'— »-oo iV 

n—0 

Keep in mind that these quantities, if they exist at all, are random variables. 
For example, if we actually view a sample function {Xn{uj); n G Z}, then 
the sample autocorrelation is 

N-l 

Tlx{k)= lim —^Xn{uj)X*n_k{uj), 

also a function of the sample point w and hence a random variable. Of 
particular interest is the autocorrelation for 0 lag: 



Vx=Tlx{^) 



lim 

N^OO 



1 

N 



N-l 






which can be considered as the sample or time average power of the sam- 
ple function in the sense that it is the average power dissipated in a unit 
resistance if Xn corresponds to a voltage. 

Analogous to the expectations, the time-average autocorrelation func- 
tion and the time-average covariance function are related by 

ICx{k)=TZx{k)-\M\^. (5.44) 



In fact, subject to suitable technical conditions as described in the laws of 
large numbers, the time averages should be the same as the expectations, 
that is, under suitable conditions a weakly stationary random process {Xn} 
should have the properties that 

m = M 
Rx{k) = Rxik) 

Kx{k) = Xx{k), 



which provides a suggestion of how the expectations can be estimated in 
practice. Typically the actual moments are not known a priori, but the 
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random process is observed over a finite time N and the results used to 
estimate the moments, e.g., the sample mean 

N-l 

n=0 

and the sample autocorrelation function 

N-l 

=nT. 

n— 0 

provide intuitive estimates of the actual moments which should converge 
to the true moments as N —>■ oo. 

There are in fact many ways to estimate second-order moments and their 
is a wide literature on the subject. For example, the observed samples may 
be weighted or “windowed” so as to diminish the impact of samples in the 
distant past or near the borders of separate blocks of data which are handled 
separately. The literature on estimating correlations and covariances is 
particularly rich in the speech processing area. 

If the process meets the conditions of the law of large numbers, then its 
sample average power Vx will be Rx(fi), which is typically some nonzero 

N-l 

positive number. But if the limit limjv^oo is not zero, then 

k—0 

observe that necessarily the limit 

N—1 oo 

lim = 

N^oo ^ ^ ' 
k=0 k=0 

must blow up since it lacks the normalizing N in the denominator. In 
other words, a sample function with nonzero average power will have infi- 
nite energy. The point of this observation is that a sample function from 
a perfectly reasonable random process will not meet the conditions for the 
existence of a Fourier transform, which suggests we might not be able to 
apply the considerable theory of Fourier analysis when considering the be- 
havior of random processes in linear systems. Happily this is not the case, 
but Fourier analysis of random processes will be somewhat different (as well 
as similar) to the Fourier analysis of deterministic finite energy signals and 
of deterministic periodic signals. 

To motivate a possible remedy, first “window” the sample signal {Xn', n G 
Z} to form a new signal n G Zj defined by 






X„ if n < iV - 1 
0 otherwise 



(5.45) 
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The new random process clearly has finite energy (and is absolutely 

summable) so it has a Fourier transform in the usual sense, which can be 
defined as 



n=0 



N-1 



n=0 



which is the Fourier transform or spectrum of the truncated sample signal. 
Keep in mind that this is a random variable, it depends on the underlying 
sample point tv through the sample waveform selected. From Parceval’s (or 
Plancherel’s) theorem, the energy in the truncated signal can be evaluated 
from the spectrum as 



Sn 



N-l 



E 

n—O 




\XN{fWdf. 



(5.46) 



The average power is obtained by normalizing the average energy by the 
time duration N: 



Vn 



N-l 



N ^ 



n=0 



1 

N 




\XN{f)\^df. 



(5.47) 



Because of this formula |Tat (/) \‘^/N can be considered as the power spectral 
density of the truncated waveform because, analogous to a probability den- 
sity or a mass density, it is a nonnegative function which when integrated 
gives the power. Unfortunately it gives only the power spectral density for 
a particular truncated sample function when what is really desired is a no- 
tion of power spectral density for the entire random process. An alternative 
definition of power spectral density resolves these two issues by taking the 
expectation to get rid of the randomness, and the limit to look at the entire 
signal, that is, to define the average power spectral density as the limit (if 
it exists) 



lim 

N — >-oo 






N 




5.8. ^DIFFERENTIATING RANDOM PROCESSES 



309 



To evaluate this limit consider 

E{\XNim 



lim 

N—^OC 



N 



N-l 



^-i2Tvfk\2 



lim —E i I XkC 

N \ ‘ ^ 

\ fc =0 

/N-l N-l 

\fc=0 /=0 
N-l N-l 

J“L V E E £[v.xrie-“«‘-‘> 



♦p+i2-7r/i 

I e 



N — >oo N 
1 



fc =0 1=0 
N-l N-l 



^-i2Trf{k-l) 



fc=0 /=0 

J” V E (1 - 



k=-{N-l) 



where the last term involves reordering terms using Lemma B.l (analogous 
to what was done to prove the law of large numbers for asymptotically 
uncorrelated weakly stationary processes). If the autocorrelation function 
is absolutely summable, i.e., if 

OO 

|i?x(fc)|<oo, (5.48) 

k— — oc) 



then Lemma B.2 implies that 

lini g Rx{k)e-^^^f>^ = Sxif), ( 5 . 49 ) 

N^oo Jy ^ ^ 

k— — oo 



the power spectral density as earlier defined. 



5.8 TirDifferentiating Random Processes 

We have said that linear systems can often be described by means other 
than convolution integrals, e.g., difference equations for discrete time and 
differential equations in continuous time. In this section we explore the I/O 
relations for a simple continuous time differentiator in order to demonstrate 
some of the techniques involved for handling such systems. In addition, the 
results developed will provide another interpretation of white noise. 
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Suppose now that we have a continuous time random process {X(t)} 
and we form a new random process {y(t)} by differentiating; that is, 

y«) = d x(t) . 

In this section we will take {X(t)} to be a zero-mean random process for 
simplicity. Results for nonzero-mean processes are found by noting that 
X{t) can be written as the sum of a zero-mean random process plus the 
mean function m{t). That is, we can write X(t) = Xo(t) m{t) where 
Xo{t) = X{t) — m{t). Then, the derivative of X{t) is the derivative of a 
zero-mean random process plus the derivative of the mean function. The 
derivative of the mean function is a derivative in the usual sense and hence 
provides no special problems. 

To be strictly correct, there is a problem in interpreting what the deriva- 
tive means when the thing being differentiated is a random process. A 
derivative is defined as a limit, and as we found in chapter 4, there are 
several notions of limits of sequences of random variables. Care is required 
because the limit may exist in one sense but not necessarily in another. In 
particular, two natural definitions for the derivative of a random process 
correspond to convergence with probability one and convergence in mean 
square. As a first possibility we could assume that each sample function 
of Y{t) is obtained by differentiating each sample function of X{t); that is, 
we could use ordinary differentiation on the sample functions. This gives 
us a definition of the form 

Y{t,uj) = ^X{t,Lv) 

X (t Xtj uj) — X (t, uj) 

= hm ; . 

Ai^O At 



If P{{u! : the limit exists}) = 1, then the definition of differentiation corre- 
sponds to convergence with probability one. Alternatively, we could define 

Y{t) as a, limit in quadratic mean of the random variables — ^ ^ 

as At goes to zero (which does not require that the derivative exist with 
probability one on sample functions). With this definition we obtain 



Y (t) = l.i.m. 

At^O 



X{t + At) - X{t) 
At 



Clearly a choice of definition of derivative must be made in order to develop 
a theory for this simple problem and, more generally, for linear systems 
described by differential equations. We will completely avoid the issue here 
by sketching a development with the assumption that all of the derivatives 
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exist as required. We will blithely ignore careful specification of conditions 
under which the formulas make sense. (Mathematicians sometimes refer to 
such derivations as formal developments: Techniques are used as if they are 
applicable and to see what happens. This often provides the answer to a 
problem, which, once known, can then be proved rigorously to be correct.) 

Although we make no attempt to prove it, the result we will obtain 
can be shown to hold under sufficient regularity conditions on the process. 
In engineering applications these regularity conditions are almost always 
either satisfied, or if they are not satisfied, the answers that we obtain can 
be applied anyway, with care. 

Formally define a process {Yj\t{t)} for a fixed At as the following dif- 
ference, which approximates the derivative of X (t) : 

,, X{t + At)-X{t) 

• 

This difference process is perfectly well defined for any fixed At > 0 and 
in some sense it should converge to the desired Y{t) as At ^ 0. We can 
easily find the following correlation: 

A[yAt(i)TAs(s)] 

^\X{t + At)-X{t) X{s + As) -X{sY 
At As 

Rx{t At, s As) — Rx{t ~\~ At, s) — Rx{t^ s -t- As) Rx{t^ s) 

~ AtAs ■ 

If we now (formally) let At and As go to zero, then, if the various limits 
exist, this formula becomes 

RY{t,s) = ^^Rx{t,s) . (5.50) 

As previously remarked, we will not try to specify complete conditions 
under which this sleight of hand can be made rigorous. Suffice it to say 
that if the conditions on the X process are sufficiently strong, the formula 
is valid. Intuitively, since differentiation and expectation are linear opera- 
tions, the formula follows from the assumption that the linear operations 
commute, as they usually do. There are, however, serious issues of existence 
involved in making the proof precise. 

One obvious regularity condition to apply is that the double derivative 
of (5.50) exists. If it does and the processes are weakly stationary, then 
we can transform (5.50) by using the property of Fourier transforms that 
differentiation in the time domain corresponds to multiplication by / in the 
frequency domain. Then for the double derivative to exist we obtain the 
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requirement that the spectral density of {h"(t)} have finite second moment, 
i.e., if Svif) = PSxif), then 




Srif) df <oo . 



(5.51) 



As a rather paradoxical application of (5.50), suppose that we have a 
one-sided continuous time Gaussian random process {X{t); t > 0} that has 
zero mean and a covariance function that is the continuous time analog of 
example [5.3]; that is, Kx{t,s) = cr^min(t, s). (The Kolmogorov construc- 
tion guarantees that there is such a random process; that is, that it is well 
defined.) This process is known as the continuous time Wiener process, a 
process that we will encounter again in the next chapter. Strictly speaking, 
the double derivative of this function does not exist because of the discon- 
tinuity of the function at t = s. From engineering intuition, however, the 
derivative of such a step discontinuity is an impulse, suggesting that 

Ry{t, s) = — s) , 



the covariance function for Gaussian white noise! Because of this formal re- 
lation, Gaussian white noise is sometimes described as the formal derivative 
of a Wiener process. We have to play loose with mathematics to find this 
result, and the sloppiness cannot be removed in a straightforward manner. 
In fact, it is known from the theory of Wiener processes that they have the 
strange attribute of producing with probability one sample waveforms that 
are continuous but nowhere differentiable! Thus we are considering white 
noise as the derivative of a process that is not differentiable. In a sense, 
however, this is a useful intuition that is consistent with the extremely 
pathological behavior of sample waveforms of white noise — an idealized 
concept of a process that cannot really exist anyway. 



5.9 ^Linear Estimation and Filtering 

In this section we give another application of second-order moments in lin- 
ear systems by showing how they arise in one of the basic problems of 
communication; estimating the outcomes of one random process based on 
observations of another process using a linear filter. The initial results can 
be viewed as process variations on the vector results of Section 4.10, but 
we develop them independently here in the process and linear filtering con- 
text for completeness. We will obtain the classical orthogonality principle 
and the Wiener-Hopf equation and consider solutions for various simple 
cases. This section provides additional practice in manipulating second- 
order moments of random processes and provides more evidence for their 
importance. 
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We will focus on discrete time for the usual reasons, but the continuous 
time analogs are found, also as usual, by replacing the sums by integrals. 

Suppose that we are given a record of observations of values of one 
random process; e.g., we are told the values of N < i < M}, and 
we are asked to form the best estimate of a particular sample say Xn of 
another, related random process {Xk] k G T}. We refer to the collection of 
indices of observed samples by /C = {N,M). We permit N and M to take 
on infinite values. For convenience we assume throughout this section that 
both processes have zero means for all time. We place the strong constraint 
on the estimate that it must be linear; that is, the estimate Xn of Xn must 
have the form 

Xn — ^ ^ hkTn—k — ^ ^ hn—kTk 

k: n—k^K k^K 

for some pulse response h. We wish to find the “best” possible filter h, 
perhaps under additional constraints such as causality. One possible notion 
of best is to define the error 



Nn Xn 

and define that filter to be best within some class if it minimizes the mean 
squared error E(e^); that is, a filter satisfying some constraints will be 
considered optimum if no other filter yields a smaller expected squared 
error. The filter accomplishing this goal is often called a linear least squared 
error (LLSE) filter. 

Many constraints on the filter or observation times are possible. Typical 
constraints on the filter and on the observations are the following: 

1. We have a non-causal filter that can “see” into the infinite future 
and a two-sided infinite observation {Y^; k G Z}. Here we consider 
N = —oo and M = oo. This is clearly not completely possible, but 
it may be a reasonable approximation for a system using a very long 
observation record to estimate a sample of a related process in the 
middle of the records. 

2. The filter is causal {hk = 0 for A: < 0), a constraint that can be 
incorporated by assuming that n > M] that is, that samples occurring 
after the one we wish to estimate are not observed. When n > M the 
estimator is sometimes called a predictor since it estimates the value 
of the desired process at a time later than the last observation. Here 
we assume that we observe the entire past of the Y process; that is, 
we take N = —oo and observe {Yk; k < M}. If, for example, the X 
process and the Y process are the same and M = n, then this case is 
called the one-step predictor (based on the semi-infinite past). 
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3. The filter is causal, and we have a finite record of T seconds; that is, 
we observe {Yfc; M — T < K < M}. 

As one might suspect, the fewer the constraints, the easier the solution 
but the less practical the resulting filter. We will develop a general charac- 
terization for the optimum filters, but we will provide specific solutions only 
for certain special cases. We formally state the basic result as a theorem 
and then prove it. 

Theorem 5.1 Suppose that we are given a set of observations {y^; k G 
K-} of a zero-mean random process {Yk} and that we wish to find a linear 
estimate A„ of a sample A„ of a zero-mean random process {A„} of the 
form 



^ ^ hk^n—k • 
k: n—k^K. 


(5.52) 


If the estimation error is defined as 




Cn — iH-n •> 




then for a fixed n no linear filter can yield a smaller expected squared error 
A(e^) than a filter h (if it exists) that satisfies the relation 


E{cnYk) = 0 ; all k G K. , 


(5.53) 


or, equivalently. 




E{XnYk) = Y. h^E{Yr,-iYk) ; all k G 1C . 


(5.54) 



i: n—i^K 



If Ry{k, j) = E(YkYj) is the autocorrelation function of the Y process and 
Rx,Y{k,j) = E{HkYj) is the cross correlation function of the two processes, 
then (5.54-) can he written as 

Rx,Y{n,k) = hiRY{n — i,k) ; all k G K. . (5.55) 

i: n—i^K 

If the processes are jointly weakly stationary in the sense that both are 
individually weakly stationary with a cross-correlation function that depends 
only on the difference between the arguments, then, with the replacement of 
k by n — k, the condition becomes 

Rx.Y(k) = hiRyik — i) all k : n — k G K. . 

i: n—i^K 



(5.56) 
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Comments. Two random variables U and V are said to be orthogonal 
if E(UV) = 0. Therefore equation (5.53) is known as the orthogonality 
principle because it states that the optimal filter causes the estimation 
error to be orthogonal to the observations. Note that (5.53) implies not 
only that the estimation error is orthogonal to the observations, but that it 
is also orthogonal to all linear combinations of the observations. Relation 
(5.56) with K, = (— oo,n) is known as the Wiener-Hopf equation. To be 
useful in practice, we must be able to find a pulse response that solves one 
of these equations. We shall later find solutions for some simple cases. A 
more general treatment is beyond the intended scope of this book. Our 
emphasis here is to demonstrate an example in which determination of an 
optimal filter for a reasonably general problem requires the solution of an 
equation given in terms of second-order moments. 

Proof. Suppose that we have a filter h that satisfies the given conditions. 
Let g be any other linear filter with the same input observations and let 
Xn be the resulting estimate. We will show that the given conditions imply 
that g can yield an expected squared error no better than that of h. Let 
in = Xn — Xn be the estimation error using g so that 

E{il) = E{{Xn- 9^Yn-^f). 

i: n—i^K. 

Add and subtract the estimate using h satisfying the conditions of the 
theorem and expand the square to obtain 

E{il) = E{{Xn- Y. II h,Yn-^- Y 9^Y„-if) = 

i:n—i^K i:n—i^K i:n—i^K 

E{{Xn- Y h,Yn-i)'^)+2E{{Xn- Y ^*^"-*)( H {h^-g^)Yn-i)) 
+E{{ Y (hi - g,)Yn-if) . 

i: n—i^K 

The first term on the right is the expected squared error using the filter 
h, say E{en). The last term on the right is the expectation of something 
squared and is hence nonnegative. Thus we have the lower bound 

E{il) > 

E{el)+ Y {h^-g^)lE{XnYn-i)- Y hjE{Yn-jYn-, 

i:n—i£K I j'n—jGK 

where we have brought one of the sums out, used different dummy variables 
for the two sums, and interchanged some expectations and sums. From 
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(5.54), however, the bracketed term is zero for each i in the index set being 
summed over, and hence the entire sum is zero, proving that 

E{~el) > E{el) , 

which completes the proof of the theorem. 

Note that from (5.52) through (5.56) we can write the mean square error 
for the optimum linear filter as 



E{el) 



in general and 




Rx{n,n) - 



^ ^ hk^n—k j 

k: n—k^K / 

hkRx,Y{n,'i 



- E{e„Xn) 
k) 



k: n—k^K 



E{in) = Rx{0) — ^ hkRx,Y{k) 

k: n—k^K, 



for weakly stationary processes. 

Older proofs of the result just given use the calculus of variations, that 
is, calculus minimization techniques. The method we have used, however, 
is simple and intuitive and shows that a filter satisfying the given equations 
actually yields a global minimum to the mean squared error and not only 
a local minimum as usually obtained by obtained by calculus methods. A 
popular proof of the basic orthogonality principle is based on Hilbert space 
methods and the projection theorem, the generalization of the standard 
geometric result that the shortest line from a point to a plane is the projec- 
tion of the point on the plane — the line passing through the point which 
meets the plane at a right angle (is orthogonal to the plane) . The projection 
method also proves that the filter of (5.53) yields a global minimum. 

We consider four examples in which the theorem can be applied to 
construct an estimate. The first two are fairly simple and suffice for a brief 
reading. 

[ 5 . 4 ] Suppose that the processes are jointly weakly stationary, that we are 
given the entire two-sided realization of the random process {Y^}, and 
that there are no restrictions on the linear filter h. Equation (5.56) 
then becomes 



Rx,Y{k) = hiRy{k — i) ; all fc € Z . 

i^Z 
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This equation is a simple convolution and can be solved by standard 
Fourier techniques. Take the Fourier transform of both sides and define the 
transform of the cross-correlation function Rx,y to be the cross-spectral 
density Sx.vif)- We obtain Sx.vif) = H{f)SY{f) or 



HU) 



Sx,yU) 

SyU) 



which can be inverted to find the optimal pulse response h: 

h{k) = , 

which yields an optimum estimate 






oo 

— h-Y 

i— — oc) 



Thus we have an explicit solution for the optimal linear estimator for this 
case in terms of the second-order properties of the given processes. Note, 
however, that the resulting filter is not causal in general. Another impor- 
tant observation is that the filter itself does not depend on the sample time 
n at which we wish to estimate the X process; e.g., if we want to estimate 
Xn+i, we apply the same filter to the shifted observations; that is, 

OO 

Xn+l — ^ ^ — z ■ 

i— — oo 



Thus in this example not only have we found a means of estimating for 
a fixed n, but the same filter also works for any n. When one filter works 
for all estimate sample times by simply shifting the observations, we say 
that it is a time-invariant or stationary estimator. As one might guess, such 
time invariance is a consequence of the weak stationarity of the processes. 

The most important application of example [5.4] is to “infinite smooth- 
ing,” where Yn = A„ + Vn. {V„} is a noise process that is uncorrelated 
with the signal process {AT„}, i.e., Rx,v{k) = 0 for all k. Then Rx,y = Rx 
and hence Ry = Rx + Rv: so that 



H{f) 



SxU) 

SxU) + SvU)' 



(5.57) 



[ 5 . 5 ] Again assume that the processes are jointly weakly stationary. As- 
sume that we require a causal linear filter h but that we observe 
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the infinite past of the observation process. Assume further that the 

iVo 

observation process is white noise; that is, Ryik) = — Sk- Then 

1C = {n, n — 1, n — 2, . . . }, and equation (5.56) becomes the Wiener- 
Hopf equation 

Rx,r{k) = Y. = ; kGZ+. 

i: n—i^K 

This equation easily reduces because of the delta function to 

2 

hk = ^ Rxxik) , k G Z+ . 

Thus we have for this example the optimal estimator 
^ °° 2 

Nn = NT Rx,Y{k)Yn-k ■ 

As with the previous example, the filter does not depend on n, and hence 
the estimator is time-invariant. 

The case of a white observation process is indeed special, but it suggests 
a general approach to solving the Wiener-Hopf equation, which we sketch 
next. 

[ 5 . 6 ] Assume joint weak stationarity and a causal filter on a semi-infinite 
observation sequence as in example [5.5], but do not assume that the 
observation process is white. In addition, assume that the observation 
process is physically realizable so that a spectral factorization of the 
form of (5.40) exists; that is, 

Srif) = \G{f)\^ 

for some causal stable filter with transfer function G(/). As previously 
discussed, for practical purposes, all random processes have spectral 
densities of this form. We also assume that the inverse filter, the 
filter with transfer function 1/G(/), is causal and stable. Again, this 
holds under quite general conditions. Observe in particular that you 
can’t run into trouble with G(/) being zero on a frequency interval 
of nonzero length because the condition in the spectral factorization 
theorem would be violated. 

Unlike the earlier examples, this example does not have a trivial solu- 
tion. We sketch a solution as a modification to the solution of example 
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[5.5]. The given observation process may not be white, but suppose that 
we pass it through a linear filter r with transfer function R{f) = 1/G{f) to 
obtain a new random process, say {Wn}. Since the inverse filter l/G(/)is 
assumed stable, then the W process has power spectral density Sw{f) = 
S'y(/)|i?(/)P = 1 for all /; that is, {Wn} is white. One says that the W 
process is a whitened version of the Y process, sometimes called the in- 
novations process of {F„}. Intuitively, the W process contains the same 
information as the Y process from it by passing it through the filter G(/) 
(at least in principle). Thus we can get an estimate of X„, from the W 
process that is just as good as (and no better than) that obtainable from 
the Y process. Furthermore, if we now filter the W process to estimate Xn, 
then the overall operation of the whitening filter followed by the estimating 
linear filter is also a linear filter, producing the estimate from the original 
observations. Since the inverse filter is causal, a causal estimate based on 
the W process is also a causal estimate based on the Y process. 

Because W is white, the estimate of from {W„} is given immediately 
by the solution to example [5.5]; that is, the filter h with the W process 
as input is given by hk = Rx,w{k) for k > 0. The cross-correlation of the 
X and W processes can be calculated using the standard linear filter I/O 
techniques. It turns out that the required cross-correlation is the inverse 
Fourier transform of a cross-spectral density given by 

SxMf) = = H{f) , (5.58) 



where the asterisk denotes the complex conjugate. (See problem 5.21.) 
Thus the optimal causal linear estimator given the whitened process is 



hk = 



7-1/2 Gif)* 

0 



k >0 , 

otherwise , 



and the overall optimal linear estimate has the form shown in Figure 5.3. 

Although more complicated, we again have a filter that does not depend 
on n. 

This approach to solving the Wiener-Hopf equation is called the “prewhiten- 
ing” (or “innovations” or “shaping filter”) approach and it can be made 
rigorous under quite general conditions. That is, for all practical purposes, 
the optimal filter can be written in this cascade form as a whitening filter 
followed by a LLSE filter given the whitened observations as long as the 
processes are jointly weakly stationary and the observation process’s power 
spectral density. 

When the observation interval is finite or when the processes are not 
jointly weakly stationary, the spectral factorization approach becomes quite 
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Whitening filter Estimator for X given W 



Figure 5.3: Pre whitening method 



complicated and cumbersome, and alternative methods, usually in the time 
domain, are required. The final example considers such an estimator. 

[ 5 . 7 ] Suppose that the random process we wish to estimate satisfies a dif- 
ference equation of the form 

Xn+i = Un , n > 0 , (5.59) 

where the process {Un} is a zero- mean process that is uncorrelated 
with a possibly time-varying second moment E{Un) = Pn and Xq is 
an initial random variable, independent of the {[/„}. {d>„} is a known 
sequence of constants. In other words, we know that the random 
process is defined by a time-varying linear system driven by noise 
that is uncorrelated but not necessarily stationary. Assume that the 
observation process has the form 

Yn = HnXn + 14 , (5.60) 

a scaled version of the X process plus observation noise, where Hn is 
a known sequence of constants. We also assume that the observation 
noise has zero mean and is uncorrelated but not necessarily station- 
ary, say E{Vn) = '^n- We further assume that the U and V processes 
are uncorrelated: E([/„V)c) = 0 for all n and k. Intuitively, the ran- 
dom processes are such that new values are obtained by scaling old 
values and adding some perturbations. Additional noise influences 
our observations. Suppose that we observe Yo,Yi, . . . ,Wi_i, what is 
the best linear estimate of A„? 

In a sense this problem is more restrictive than the Wiener-Hopf formu- 
lation of example [5.6] because we have assumed a particular structure for 
the process to be estimated and for the observations. On the other hand. 
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it is not a special case of the previous model because the time-varying pa- 
rameters make it nonstationary and because we restrict the observations to 
a finite time window (not including the current observation), often a better 
approximation to reality. Because of these differences the spectral tech- 
niques of the standard Wiener-Hopf solution of example [5.6] do not apply 
without significant generalization and modification. Hence we consider an- 
other approach, called recursive estimation of Kalman- Bucy filtering, whose 
history may be traced to Gauss’s formula for plotting the trajectory of heav- 
enly bodies. The basic idea is the following: Instead of considering how to 
operate on a complete observation record in order to estimate something at 
one time, suppose that we already have a good estimate Xn for X„ and that 
we make a single new observation Y„. How can we use this new information 
to update our old estimate in a linear fashion to form a new estimate X„+i 
of For example, can we find sequences of numbers a„ and so that 

Xn+l — CluNji -\- hriYn 

is a good estimate? One way to view this is that instead of constructing a 
filter h described by a convolution that operates on the past to produce an 
estimate for each time, we wish a possibly time- varying filter with feedback 
that observes its own past outputs or estimates and operates on this and 
a new observation to produce a new estimate. This is the basic idea of 
recursive filtering, which is applicable to more general models that that 
considered here. In particular, the standard developments in the literature 
consider vector generalizations of the above difference equations. We sketch 
a derivation for the simpler scalar case. 

We begin by trying to apply directly the orthogonality principle of (5.53) 
through (5.55). If we fix a time n and try to estimate Xn by a linear filter 
as 



Xn = J2 ’ ( 5 - 61 ) 

i=l 

then the LLSE filter is described by the time-dependent pulse response, say 
which, from (5.55), solves the equations 

n 

Rx,Y{n, 1) — Rvin — i,l) ; / = 0, 1, . . . , n — 1 , (5.62) 

where the superscript reflects the fact that the estimate is based on obser- 
vations through time n — 1 and the fact that for this very nonstationary 
problem, the filter will likely depend very much on n. To demonstrate this. 
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consider the estimate for Xn+i- In this case we will have a filter of the form 

n+1 

X„+i = Y, , (5.63) 

where the LLSE filter satisfies 

n+1 

Rx,Y{n + 1, 1) = Yj RY(n + 1 — t, ^) ; ? = 0, 1, . . . , n . (5.64) 

i=l 

Note that (5.64) is different from (5.62), and hence the pulse responses sat- 
isfying the respective equations will also differ. In principle these equations 
can be solved to obtain the desired filters. Since they will in general de- 
pend on n, however, we are faced with the alarming possibility of having to 
apply for each time n a completely different filter to the entire record 
of observations +,... ,+ up to the current time, clearly an impractical 
system design. We shall see, however, that a more efficient means of recur- 
sively computing the estimate can be found. It will still be based on linear 
operations, but now they will be time- varying. 

We begin by comparing (5.62) and (5.64) more carefully to find a rela- 
tion between the two filters and h{n — 1). If we consider I < n, then 
(5.59) implies that 

Rxxin +1,0 = E{X^+,Yi) = S(($„X„ + U^)Yi) 

= ^uE{X^Yi) + E{Ur,Yi) = $„S(X„rO = $„i?x.v(n,/) . 

Re-indexing the sum of (5.64) using this relation and restricting ourselves 
to I < n then yields 

n 

0 = X! ^1+1 0 

i^O 

n 

= Rvirijl) -\-'Y^h[^iRY{n — i,l) ; I = 0, 1, . . . , n — 1 . 

i=l 

But for I < n we also have that 

Rx,Y{n, 1) = E{Xr,, Yi) = E 

= ^E{Y^Yi) = ^RY{n,l) 
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or 

RvinJ) = HnRx,Y{nJ) . 

Substituting this result, we have with some algebra that 






i+l 



Rx,Y{n, 1) = y'' 

S - hr Hn 

which is the same as (5.62) if one identifies 



-Ry(ji — i,l) ; 1 = 0, 1, . . . , n — 1 , 



d"-l) _ 



h. 



i+l 






I = 1, . . . ,n . 



From (5.63) the estimate for is 



n+1 



Y \ ^ 

■^n — / 1 n—i — 












Comparing this with (5.63) yields 

X„+i = + ($„ - 



which has the desired form. It remains, however, to find a means of com- 
puting the numbers h^\ Since this really depends on only one argument 
n, we now change notation for brevity and henceforth denote this term by 
Xn', that is, 

Xn = h[^^ . 

To describe the estimator completely we need to find a means of computing 
Xn and an initial estimate. The initial estimate does not depend on any 
observations. The LLSE estimate of a random variable without observa- 
tions is the mean of the random variable (see, e.g., problem 4.23). Since by 
assumption the processes all have zero mean, Xq = 0 is the initial estimate. 

Before computing we make several remarks on the estimator and its 
properties. First, we can rewrite the estimator as 

Nn+l — XniYn HnXn) ^nNn . 

It is easily seen from the orthogonality principle that if is a LLSE 

estimate of Xn given Yo,Yii ■ ■ ■ ) Yn-i, then — HnXn = Yn~Yn = Vn can 
be interpreted as the “new” information in the observation in the sense 
that our best prediction of based on previously known samples has been 
removed from Yn. We can now write 

Nn+l ~ T ^nN^n • 
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This can be interpreted as saying that the new estimate is formed from 
the old estimate by using the same transformation 4)„ used on the actual 
samples and then by adding a term depending only on the new information. 

It also follows from the orthogonality principle that the sequence Vn is 
uncorrelated: Since HnXn is the LLSE estimate for Yn based on Voj • ■ • , Yn-i, 
the error Vn must be orthogonal to past Yj from the orthogonality princi- 
ple. Hence Vn must also be orthogonal to linear combinations of past Yi 
and hence also to past vi. It is straightforward to show that Evn = 0 for all 
n and hence orthogonality of the sequence implies that it is also uncorre- 
lated (problem 5.24). Because of the various properties, the sequence 
is called the innovations sequence of the observations process. Note the 
analog with example [5.6], where the observations were first whitened to 
form innovations and then the estimate was formed based on the whitened 
version. 

Observe next that the innovations and the estimation error are simply 
related by the formula 

Vn = Yn — HnXn = HnXn + V„ — HnXn = Hn{Xn ~ Xn) + Vn 
or 

— HnCn Yn j 

a useful formula for deriving some of the properties of the filter. For exam- 
ple, we can use this formula to find a recursion for the estimate error: 

eo = -^0 



Cn-t-l — Xn-i-1 Xn-\-l 

— (^n ^n^n)^n Un l^n^n : ^ — 0, 1. 

This formula implies that 

F(e„) = 0 ; n = 0, 1, . . . , 

and hence the estimate is unbiased (i.e., an estimate having an error which 
has zero mean is defined to be unbiased). It also provides a recursion for 
finding the expected squared estimation error: 

E(e^+,) = ( 4 >„ - KnHnfE{el) + E{Ul) + niEiy^) , 

where we have made use of the assumptions of the problem statement, viz., 
the uncorrelation of [/„ and sequences with each other, with Yg, . . . , Y„_i, 
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Xg, . . . ,Xn, and hence also with Rearranging terms for later use, we 
have that 

E{el+,) = <I>lE{el) - 2KnHAE{el) - +nl{HlE{el) + d/„) + T„ , 

(5.66) 

where T„ = E{Ul) and = E{V^). 

Since we know that if(eg) = E(Xq), if we knew the k„ we could recur- 
sively evaluate the expected squared errors from the formula and the given 
problem parameters. We now complete the system design by developing 
a formula for the This is most easily done by using the orthogonality 
relation and (5.65): 



0 = E{en+lY„) = ($„ - KnIIn)E{enYn) + E{UnYn) - KnE{VnYn) ■ 

Consider the terms on the right. Proceeding from left to right, the first 
term involves 

E{tnY^) = E{en{HnXn + K)) = i?n£^(e„(e„ + X^)) = H„E{el) , 

where we have used the fact that e„ is orthogonal to P„, to Iq) • ■ • 7 Wi-i and 
hence to X„. The second term is zero by the assumptions of the problem. 
The third term requires the evaluation 

E{VnYn) = E{Vn{HnXn + Pn)) = E{v^) . 

Thus we have that 

0 = ($„ - KnHn)H^E{el) - KnE{V^) 



or 

$„i?„£;(e2) 

E{V^) + HlE{el)- 

Thus for each n we can solve the recursion for E(e^) and for the required 
Kn to form the next estimate. 

We can now combine all of the foregoing mess to produce the final 
answer. A recursive estimator for the given model is 

Xo = 0 ; E{el) = E(X^) , (5.67) 

and for n = 0, 1, 2 , . . . , 

X„+i = K„(r„ - i7„X„) -k ; n = 0,l,... , (5.68) 
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where 



Kn 



^nHnEjel) 



and where, from (5.66) and (5.69), 






^nE(e^) + r„ 



(d>„H„if(e^))^ 
^r. + H^E{el) ■ 



(5.69) 



(5.70) 



Although these equations seem messy, they can be implemented nu- 
merically or in hardware in a straightforward manner. Variations of their 
matrix generalizations are also well suited to fast implementation. Such al- 
gorithms in greater generality are a prime focus of the areas of estimation, 
detection, signal identification, and signal processing. 



5.10 Problems 

1. Suppose that X„ is an iid Gaussian process with mean m and variance 

Let h be the pulse response ho = I, hi = r, and hk = 0 for all 
other k. Let {Wn} be the output process when the X process is put 
into the filter described by h] that is, 

Wn = Xn + rXn-l ■ 

Assuming that the processes are two-sided — that is, that they are 
defined for n G Z — find EWn and Rw{k,j)- Is {Wn} strictly 
stationary? Next assume that the processes are one-sided; that is, 
defined for n € Z+. Find EWn and Rw{k,j)- For the one-sided case, 
evaluate the limits of EWn and Rw{n, n-\- k) as n ^ oo. 

2. We define the following two-sided random processes. Let {V„} be an 
iid random process with marginal pdf fx{x) = e~“, x > 0. Let {V„} 
be another iid random process, independent of the X process, having 
marginal pdf friu) = 2e“^^, y > 0. Define a random process {C/„} 
by the difference equation 

Un = Xn + Xn-1 + V„ ■ 

The process f7„ can be thought of as the result of passing through 
a first order moving average filter and then adding noise. Find EIIq 
and Ru{k). 
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3. Let {X(t)} be a stationary continuous time random process with zero 
mean and autocorrelation function Rx{t). The process X{t) is put 
into a linear time-invariant stable filter with impulse response h{t) to 
form a random process Y(t). A random process U (t) is then defined 
as U{t) = Y(t)X{t — T), where T is a fixed delay. Find EU{t) in 
terms of Rx,h, and T. Simplify your answer for the case where 
Sxif) = No/2, all /. 



4. Find the output power spectral densities in problems 5.1 and 5.2. 



A discrete-time random process {A„; n G Z} is iid and Gaussian, 
with mean 0 and variance 1. It is the input process for a linear time 
invariant (LTI) causal filter with Kronecker delta responses h defined 

by 



hk 



N k = 0,l,... ,K-1 
0 otherwise 



so that the output process {T„} is defined by 



K-l 



1 






This filter (an FIR filter) is often referred to as a comb filter. 
A third process {W„} is defined by 



Wn = Y„- y„_i . 



(a) What are the mean and the power spectral density of the process 

{Yn}^ 

(b) Find the characteristic function My,,{ju) and the marginal pdf 

(c) Find the Kronecker delta response g of an LTI filter for which 

IIA = ^ ^ gkXrt—k' 

k 

(d) Find the covariance function of {Wn}. 

(e) Do n~^ Y/Jk=o Yk and n~^ Y/X=o converge in probability as 
n oo? If so, to what? 

5. Let {Xn} be a random process, where Xi is independent of Xj for i yf 
j. Each random variable X„ in the process is uniform on the region 
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[— l/2n, +l/2n]. That is, fx„{xn) = n, when Xn € [— l/2n, +l/2n], 
and 0 otherwise. 

Define {F„} by 



^fO n = 0,l 

\nX„-y „_2 n = 2,3,4,... 

(a) What is the expected value of X„? 

(b) What is the variance of X„? 

(c) What is the covariance function, Kx{i,j)^ 

(d) Let Sn = n~^ ^j- What is the expected value of 

(e) Does {Xn} have a WLLN, i.e., does the sample mean n~^ X)fc=o 
converge in probability to the mean if [X„]? If so, to what value 
does the sample mean converge? If it has no WLLN, explain why 
not. Make sure to justify your answer based on the definitions 
of WLLN and convergence in probability. 

(f) Find EYn. 

(g) Find i?v(t,j). 

(h) Find the cdf of I 4 + Yg- 



6. Let {X{t); t G 3?} be a stationary continuous time Gaussian random 
process with zero mean and power spectral density function 



Sxif) 



i 0<|/|<W 

0 IF < I/I < 00 ■ 



Let {Z(t); t € 3?} be a stationary continuous time Gaussian random 
process with zero mean and power spectral density function 



Szif) 



^ 0<\f\<B 
0 B<|/|<oo 



where we assume that B » IF, 7 > TVq, and that the two processes 
are mutually independent. We consider X{t) to be the “signal” and 
Z{t) to be the “noise.” The receiver observes the process {F(t)}, 
where 

Y{t) = X{t) + Z{t). 



(a) Find and sketch the power spectral density of {F(t)}. 
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(b) Find the conditional pdf fY{t)\x{t)(y\x), the marginal pdf fY{t)(y), 
and the conditional pdf fx(t)\Yit){x\y). 

(c) (4 points) Find the minimum mean squared estimate X(t) given 
the single observation Y (t) and compute the resulting mean 
squared error 

e^ = E[{X{t)-X{t)n. 

(d) Suppose that you are allowed to use the entire observed signal 
{Y{t)} to estimate X{t) at a specific time and you can do this 
by linearly filtering the observed process. Suppose in particular 
that you pass the observed process {Y{t)} through a linear filter 
with with a transfer function 

|i o<in<w 

\0 lF<|/|<oo 

with output X{t), an estimate of X{t). (This filter is not causal, 
but all the results we derived for second order input/output re- 
lations hold for noncausal filters as well and can be used here.) 
Find the resulting mean squared error E[{X{t) — X(t))^]. 

Which scheme yields smaller average mean squared error? 

Hint: Convince yourself that linearity implies that X{t) can be 
expressed as X (t) plus the output of the filter H when the input 
is Z{t). 

7. Let {Xn} be an iid Gaussian random process with zero mean and 

variance i?x(0) = Let {C/„} be an iid binary random process, in- 
dependent of the X process, with Pr([/„ = 1) = Pr(f7„ = —1) = 1/2. 
(All processes are assumed to be two-sided in this problem.) Define 
the random process = Un+Xn , , and Wn = Uo+Xn, 

all n. Find the mean, covariance, and power spectral density of each 
of these processes. Find the cross-covariance functions between the 
processes. 

8. Let {Un}, {Xn}, and {F„} be the same as in problem 5.7. The process 
{Yn} can be viewed as a binary signal corrupted by additive Gaussian 
noise. One possible method of trying to remove the noise at a receiver 
is to quantize the received to form an estimate U„ = q{Yn) of the 
original binary sample, where 



q{r) 



-1-1 if r > 0 
— 1 if r < 0 . 
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Write an integral expression for the error probability Pe = Pr(Un yf 
Un)l Find the mean, covariance, and power spectral density of the U„ 
process. Are the processes {Un} and {Un} equivalent — that is, do 
they have the same process distributions? Define an error process e„ 
by = 0 if Un = Un and e„ = 1 if C/„ yf C/„. Find the marginal pmf, 
mean, covariance, and power spectral density of the error process. 

9. Cascade filters. Let {gk} and {ofc} be the pulse responses of two 
discrete time causal linear filters {gk = ak =) for fc < 0) and let G{f) 
and A{f) be the corresponding transfer functions, e.g., 

OO 

G{f) = '£gke-^^-'^f . 

k=0 

Assume that (/o = Qo = 1- Let {Zn} be a weakly stationary uncor- 
related random process with variance cr^ and zero mean. Consider 
the cascade of two filters formed by first putting an input Zn into the 
filter g to form the process A„, which is in turn put into the filter a 
to form the output process Yn- 

(a) Let {dk} denote the pulse response of the overall cascade filter, 
that is, 

OO 

Yn = ^ ^ dkZn—1 . 
k=0 

Find an expression for in terms of {gk} and {a^}. As a check 
on your answer you should have do = 50^0 = !•) 

(b) Let D{f) be the transfer function of the cascade filter. Find 
D{f) in terms of G{f) and A{f). 

(c) Find the power spectral density Syif) in terms of G, and A. 

(d) Prove that 

. 1/2 

E (y„2) = / Sr{f)df > . 

J-lj2 

Hint: Show that if do = 1 (from part (a)), then 




\D{f)\^df=l + 




|l-D(/)|"d/> 1 . 



10. One-step prediction. This problem develops a basic result of esti- 
mation theory. No prior knowledge of estimation theory is required. 
Results from problem 5.9 may be quoted without proof (even if you 




5.10. PROBLEMS 



331 



did not complete it). Let be as in problem 5.9; that is, {Xn} is a 
discrete time zero-mean random process with power spectral density 
ct^|G(/)P, where G{f) is the transfer function of a causal filter with 
pulse response {gk} with = 1. Form the process {X^} by putting 
Xn into a causal linear time-invariant filter with pulse response hk 

OO 

Xn — ^ ^ hkXn—k ■ 

k=l 

Suppose that the linear filter tries to estimate the value of Xn based 
on the values of Xi for alH < n by choosing the pulse response {hk} 
optimally. That is, the filter estimates the next sample based on the 
present value and the entire past. Such a filter is called a one-step 
predictor. Define the error process {e„} by 

Xn Xn • 



(a) Find expressions for the power spectral density Se{f) in terms 
of Sx{f) and H{f). Use this result to evaluate 

(b) Evaluate S'e(/) and E(e^) for the case where 1 — = 1/G(/). 

(c) Use part (d) of problem 5.9 to show that the prediction filter 
H{f) of (b) in this problem yields the smallest possible value 
of E{e\) for any prediction filter. You have just developed the 
optimal one-step prediction filter for the case of a process that 
can be modeled as a weakly stationary uncorrelated sequence 
passed through a linear filter. As discussed in the text, most 
discrete time random processes can be modeled in such a fashion, 
at least through second-order properties. 

(d) Spectral factorization. Suppose that {Y„} has a power spectral 
density Sx{f) that satisfies 



. 1/2 

/ ^nSx{f)df <oo . 
J-l/2 



Expand lnS'x(/) in a Fourier series and write the expression for 
exp (In Sxif)) in terms of the series to find G(/). Find the pulse 
response of the optimum prediction filter in terms of your result. 
Find the mean square error. {Hint: You will need to know what 
evenness of Sx{f) implies for the coefficients in the requested 
series and what the Taylor series of an exponential is.) 



11. Binary filters. All of the linear filters considered so far were linear in 
the sense of real arithmetic. It is sometimes useful to consider filters 
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that are linear in other algebraic systems, e.g., in binary or modulo 2 
arithmetic as defined in (3.65-3.66). Such systems are more appropri- 
ate, for example, when considering communications systems involving 
only binary arithmetic, such as binary codes for noise immunity on 
digital communication links. A binary first-order autoregressive filter 
with input process {A„} and output process {F„} is defined by the 
difference equation 

Yn = Yn-i ®Xn , all n . 

Assume that the {A„} is a Bernoulli process with parameter p. In 
this case the process {1^} is called a binary first-order autoregressive 
source. 

(a) Show that for nonnegative integers k, the autocorrelation func- 
tion of the process {Yn} satisfies 

1 f ^ 

Ryik) = E(YjYj^k) = - Pr I Xi = an even number 

\i=l 

(b) Use the result of (a) to evaluate Ry and Ky. Hint: This is most 
easily done using a trick. Define the random variable 

k 

Wu = Y,X,. 

i=l 

Wk is a binomial random variable. Use this fact and the binomial 
theorem to show that 

Pr(IUfe is odd) — Pv{Wk is even) = —(1 — 2p)^ . 

Alternatively, find a linear recursion relation for pk = Pr(Wk is 
odd) using conditional probability (i.e., find a formula giving pk 
in terms of Pk-i) and then solve for 

(c) Find the power spectral density of the process {U„}. 

12. Let {Xn} be a Bernoulli random process with parameter p and let 0 
denote mod 2 addition as defined in problem 5.11. Define the first- 
order binary moving average process {Wn} by the difference equation 

1U„ = Xn © Xn-1 

This is a mod 2 convolution and an example of what is called a convo- 
lutional code in communication and information theory. Find p\y^(w) 
and Rw{k,j)- Find the power spectral density of the process {1U„}. 
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13. Let {X(t)} be a continuous time zero-mean Gaussian random process 
with spectral density Sx{f) = Nq/2, all /. Let H{f) and G(/) be 
the transfer functions of two linear time-invariant filters with impulse 
responses h{t) and g{t), respectively. The process {X(f)} is passed 
through the filter h{t) to obtain a process {L^(t)} and is also passed 
through the filter g{t) to obtain a process {l^(t)}; that is, 

pOO 

Y{t)= / h{r)X{t — T)dr^ 

Jo 

pOC) 

V{t)= / g{T)X{t- T)dr . 

Jo 

(a) Find the cross-correlation function RY,v{t,s) = E{YtVs). 

(b) Under what assumptions on H and G are Yt and Vt independent 
random variables? 

14. Let {X(t)} and {U(t)} be two continuous time zero-mean stationary 
Gaussian processes with a common autocorrelation function R{t) and 
common power spectral densities S{f). Assume that X{t) and Y{t) 
are independent for all t,s. Assume also that A[A(t)F(s)] = 0 all 
t, s and that cr^ = R{0). For a fixed frequency /o, define the random 
process 

W{t) = A(t) cos(27r/ot) -I- U(t) sin(27r/ot) . 

Find the mean E{W{t)) and autocorrelation Ry/{t,s). Is {W{t)} 
weakly stationary? 

15. Say that we are given an iid binary random process {A„} with al- 
phabet ±1, each having probability 1/2. We form a continuous time 
random process {A(t)} by assigning 

A(t) = A„ ; te [(n-l),nT) , 

for a fixed time T. This process can also be described as follows: Let 
p{t) be a pulse that is 1 for t G [0, T) and 0 elsewhere. Define 

X{t) = Y,XkP{t-kT) . 

k 

This is an example of pulse amplitude modulation (PAM). If the 
process X{t) is then used to phase-modulate a carrier, the resulting 
process is called a phase-shift-keyed modulation of the carrier by the 
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process {X(t)} PSK. PSK is a popular technique for digital commu- 
nications. Define the PSK process 

U{t) = ao cos(27r/of -I- SX{t)) . 

Observe that neither of these processes is stationary, but we can force 
them to be at least weakly stationary by the trick of inserting uni- 
form random variables in appropriate places. Let Z be a random 
variable, uniformly distributed [0, T] and independent of the original 
iid process. Define the random process 

Y{t) = X{t + Z) . 

Let 0 be a random variable uniformly distributed on [0,l//o] and 
independent of Z and of the original iid random process. Define the 
process 

V{t) = U{t + Q) . 

Find the mean and autocorrelation functions of the processes Y(t) 
and V{t). 

16. Let {Y(t)} be a Gaussian random process with zero mean and auto- 
correlation function 




Find the power spectral density of the process. Let Y (t) be the pro- 
cess formed by DSB-SC modulation of X{t). Letting 0 be uniformly 
distributed in equation (5.37), sketch the lower spectral density of the 
modulated process. 

17. A continuous time two-sided weakly stationary Gaussian random pro- 
cess {5'(t)} with zero mean and power spectral density Ss{f) is put 
into a noisy communication channel. First, white Gaussian noise 
{VF(t)} with power spectral density Nq/2 is added, where the two 
random processes are assumed to be independent of one another, and 
then the sum S'(t)-|-kF(t) is passed through a linear filter with impulse 
response h{t) and transfer function H{f) to form a received process 
{K(t)}. Find an expression for the power spectral density S'y(/). 
Find an expression for the expected square error E[{S{t) — Y{t)Y] 
and the so-called signal-to-noise (SNR) 



E[{S{t)-Y{t)Y] ■ 
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Suppose that you know that Syif) can be factored into the form 
|G(/)p, where G{f) is a stable causal filter with a stable causal in- 
verse. What is the best choice of H{f) in the sense of maximizing the 
signal to noise ratio? What is the best causal H{f)l 

18. Show that equation (5.2) converges in mean square if the filter is stable 
and the input process has finitely bounded mean and variance. Show 
that convergence with probability one is achieved if the convergence 
of equation (A. 30) is fast enough for the pulse response. 

19. Show that the sum of equation (5.7) converges for the two-weakly 
stationary case if the filter is stable and the input process has finitely 
bounded variance. 

20. Provide a formal argument for the integration counterpart of equation 
(5.51); that is, if {A(t)} is a stationary two-sided construction time 

random process and Y{t) = / X{s)dx, then, subject to suitable 

J — OO 

technical conditions, Sy{f) = Sx{f)/ P- 

21. Prove that equation (5.58) holds under the conditions given. 

22. Suppose that {Pat} is as in example [5.1] and that = Y„ + Un, 
where Un is a zero-mean white noise process with second moment 
E{U‘^) = Nq/2. Solve the Wiener-Hopf equation to obtain a LLSE 
of Yn+m given {Wf, i < n} for m > n. Evaluate the resulting mean 
squared error. 

23. Prove the claim that if {A„} and {Yn] are described by equations 
(5.59) and (5.60) and if A„ is a LLSE estimate of A„ given Ij), Pi, . . . , P„_i, 
then P„ = HnXn is a LLSE estimate of P„ given the same observa- 
tions. 

24. Prove the claim that the innovations sequence {t'„} of example [5.7] is 
uncorrelated and has zero mean. (Fill in the details of the arguments 
used in the text.) 

25. Let {Piv} be as in example [5.2]. Find the LLSE for Yn+m given 
{Po,Pi,... ,Pn} for an arbitrary positive integer m. Evaluate the 
mean square error. Repeat for the process of example [5.3] (the same 
process with r = 1). 

26. Specialize the recursive estimator formulas of equations (5.67) through 
(5.70) to the case where {Xn} is the {Yn} process of example [5.2], 
where i7„ is a constant, say a, and where 'En — Nq/2, all n. Describe 
the behavior of the estimator as n oo. 
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27. Find an expression for the mean square error in example [5.4]. Spe- 
cialize to infinite smoothing. 

28. In the section on linear estimation we assumed that all processes had 
zero-mean functions. In this problem we remove this assumption. Let 
{Xn} and {Y„} be random processes with mean functions {mx{n)} 
and {my(n)}, respectively. We estimate Xn by adding a constant to 
equation (5.52); i.e., 

Xn — ^n A ^ ^ hkYn—k ■ 

k:n—k^K 

(a) Show that the minimum mean square estimate of X„ is X„ = 
mx{n) if no observations are used. 

(b) Modify and prove theorem 5.1 to allow for the nonzero means. 

29. Suppose that {Xn} and {Zn} are zero mean, mutually independent, 
iid, two-sided Gaussian random processes with correlations 

Rx{k) = alSk ; Rz{k) = alSk ; 

These processes are used to construct new processes as follows: 

Yn = Zn+rYn-1 
Un = Xn + Zn 
Wn = Un+rlIn-1 

Find the covariance and power spectral densities of {Un} and {Wn}- 
FindE[{Xn-Wn)% 

30. Suppose that {Zn} and {VF„} are two mutually independent two- 
sided zero mean iid Gaussian processes with variances (t| and 
respectively. is put into a linear time-invariant filter to form an 
output process {Xn} defined by 

Xn — Zn vZn—1-, 

where 0 < r < 1. (Such a filter is sometimes called a preemphasis 
filter in speech processing.) This process is then used to form a new 
process 

Yn=Xn + Wn, 

which can be viewed as a noisy version of the preemphasized Zn 
process. Lastly, the process is put through a “deemphasis filter” 
to form an output process [/„ defined by 

Un = rUn-1 + Yn- 
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(a) Find the autocorrelation Rz and the power spectral density Sz. 
Recall that for a weakly stationary discrete time process with 
zero mean Rz{k) = E{ZnZn+k) and 

OO 

Sz(f)= 

k— — (X) 

the discrete time Fourier transform of Rz. 

(b) Find the autocorrelation Rx and the power spectral density S'x- 

(c) Find the autocorrelation Ry and the power spectral density Sy. 

(d) Find the overall mean squared error E[{Un — Zn)"^]. 

31. Suppose that {Xn; n G Z} is a discrete time iid Gaussian random 
processes with 0 mean and variance = E[Xq]. We consider this 
an input signal to a signal processing system. Suppose also that 
{Wn] n G Z} is a discrete time iid Gaussian random processes with 
0 mean and variance cr^ and that the two processes are mutually 
independent. hF„ is considered to be noise. Suppose that Xn is put 
into a linear filter with unit pulse response h, where 

{ 1 A: = 0 
-1 k = -I 
0 otherwise 

to form an output U = X * h, the convolution of the input signal and 
the unit pulse response. The final output signal is then formed by 
adding the noise to the filtered input signal, = C/„ + 1F„. 

(a) Find the mean, power spectral density, and marginal pdf for C/„. 
(b) Find the mean, covariance, and power spectral density for Yn. 
(c) FmdE[YnXn]. 

(d) Does the mean ergodic theorem hold for {Wi}? 

32. Suppose that {X{t)-, t G TZ} is a weakly stationary continuous time 
Gaussian random processes with 0 mean and autocorrelation function 

Rx{T)=E[X{t)X{t + T)]=a\e-\^\. 

(a) Define the random process {Y{t)] t G TZ} by 

Y{t)= [ X{a)da, 

Jt-T 

where T > 0 is a fixed parameter. (This is a short term integra- 
tor.) Find the mean and power spectral density of {F(t)}. 
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(b) For fixed t > s, find the characteristic function and the pdf for 
the random variable X{t) — X{s). 

(c) Consider the following nonlinear modulation scheme: Define 

W{t) = eJ(27T/oi+cX(t)+e)^ 

where /o is a fixed frequency, 0 is a uniform random variable on 
[0, 27 t], 0 is independent of all of the X{t), and c is a modulation 
constant. (This is a mathematical model for phase modulation.) 
Define the expectation of a complex random variable in the nat- 
ural way, that is, if Z = ift(Z) -hjQ(Z), then E(Z) = E[iR{Z)] -\- 
jE[A(Z)].) Define the autocorrelation of a complex valued ran- 
dom process W (t) by 

Rw{t,s) = E{W{t)W{s)*), 

where W{s)* denotes the complex conjugate of W{s). 

Find the mean E{W (t)) and the autocorrelation function Rw{t, s) 
E[W{t)W{s)*]. 

Hint: The autocorrelation is admittedly a trick question (but 
a very useful trick). Keep part (b) in mind and think about 
characteristic functions. 



33. A random variable X is described by a pmf 



Px{k) 



ca'^ k = 0,1, ■ ■ ■ 
0 else 



(5.71) 



where 0 < a < 1. A random variable Z is described by a pmf 

Pz{k) = ^, k = ±l. (5.72) 

(a) Find the mean, variance and characteristic function of Z. 

(b) Evaluate c and find the mean, variance, and characteristic func- 
tion of X. 

(c) Now suppose that {AT„} and {Z„} are two mutually independent 
iid random processes with marginal pmf’s px of (5.71) and pz 
of (5.72), respectively. Form a new random process Y„ defined 
by 

y„ = A„Z„ alln. (5.73) 

Find the mean and covariance function for Is weakly 
stationary? If so, find its power spectral density. 
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(d) Find the marginal pmf 

(e) Find the probability Pr(X„ > 2Xn-i) • 

(f) Find the conditional expectations E[Yn\Zn] and E[Xn\Yn\. 

(g) Find the probability Pr(l^ > 2P„_i) . 

34. Suppose that {X„,} is an iid random process with marginal pdf 

f (x) — I a: > 0 

^ I 0 otherwise 



Let be a fixed positive integer. 



(a) What is the probability that at least one of the samples Xq, . . . ,Xn. 
exceeds a fixed positive value 7? 

(b) What is the probability that all of the samples Xq, . . . ,Xn-i 
exceed a fixed positive value 7? 

(c) Define a new process C/„ = ZX^, where Z is a binary random 
variable with the marginal pmf of equation (5.72) and and the Z 
is independent of all the X„. Find the mean [/„ and covariance 
Ku of [/„. 

Is Un weakly stationary? Is it iid? 

(d) Does the sample average 



I n— 1 

= - V C/fe 

T7 f ^ 






converge in probability? If yes, to what? 

(e) Find a simple nontrivial numerical upper bound to the proba- 
bility 

Pr(|C/„-C7| > IQau), 
where is the variance of Uq. 



35. Suppose that {X^} is a weakly stationary random process with zero 
mean and autocorrelation Rx{k) = for all integer k, here |o:| < 

1. A new random process {Yn} is defined by the relation = A„ -|- 



(a) Find the autocorrelation function RY{k) and the average power 
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(b) For what value of (3 is {Yn} a white noise process? I.e., the 
value of (3 for which Syif) is a constant? This is an example of 
a whitening filter. 

(c) Suppose that (3 is chosen as in the previous part so that is 
white. (You do not need the actual value of [3 for this part, you 
can leave things in terms of f3 if you did not do the previous 
part.) Assume also that {Y„} is a Gaussian random process. 
Find the variance and the pdf of the random variable 

N-l 

= V Y r.. 

i=0 

where is a fixed positive integer. 

36. Suppose that {Y„; n G Z} is a, Bernoulli random process with pa- 
rameter p, i.e., it is an iid binary process with px(l) = 1 —px(fi) = P- 
Suppose that Z is a binary random variable with the pmf of equation 
(5.72) and that Z and the are independent of each other. Define 
for integers n > k > 0 the random variables 

n 

Wk,n= Y. 

i-k+1 



Define a one-sided random process {Yn, n = 0, 1, . . . } as follows: 



Y„ 



Z n=0 

r„_i(-l)^" n=l,2,... 



Note that for any n > k > 0, 

Y„ = Yfc(-1)'^'=>". (5.74) 

(a) Find the mean niy = E[Yn]. Show that Pf„( 1) can be expressed 
as a very simple function of my and use this fact to evaluate 
Py„ (y) for any nonnegative integer n. 

(b) Find the mean, variance, and characteristic function of Wk^n- 

(c) If you fix a positive integer k, do the random variables 

Wk,n 
n — k 

converge in mean square as n ^ oo? If so, to what? 
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(d) Write an expression for the conditional pmf for n > 

/c > 0 in terms of the the random variable Wk,n- Evaluate this 
probability. 

Hint: Half credit for this part will be given if you get the general 
expression, i.e., a sum with correct limits and summand, correct. 
The actual evaluation is a bit tricky, so do not waste time on it 
if you do not see the trick. 

(e) Find the covariance function Kyik.n) of {Yn}. 

Hint: One way (not the only way) to do this part is to consider 
the case n > fc > 0, use equation (5.74) and and the fact 

-1 = (5.75) 

and try to make your formula look like the characteristic function 
for Wk,n- 

37. (Problem courtesy of the ECE Department of the Technion.) Con- 
sider a process {Yj; t G 3?} that can take on only the values {—1, -1-1} 
and suppose that 

PYti+1) =pn(-l) = 0.5 
for all t. Suppose also that for t > 0 



py,+^\y,{M - 1) = Pn+dn(-i| + 1) 




T < T 
r > T 



(a) Find the autocorrelation function Ry of the process {Yj; t G 3?}. 

(b) Find the power spectral density Sy{f). 

38. (Problem courtesy of the ECE Department of the Technion.) A 
known deterministic signal {s(t); t G 3?} is transmitted over a noisy 
channel and the received signal is {A'(t); t G 3?}, where X{t) = 
As{t) + W{t), where {W{t)] t G 3?} is a Gaussian white noise pro- 
cess with power spectral density Swif) = Nq/2; / G 3? and a is a 
random variable independent of W (t) for all t. The receiver, which 
is assumed to know the transmitted signal, computes the statistic 
Yt = X{t) dt. 

(a) Find the conditional pdf fYTlAiyW) ■ 

(b) Assuming that A is A7(0,cr^), find the MMSE estimate of A 
given yr. 

(c) Find the MMSE resulting in the previous part. 




342 



CHAPTER 5. SECOND-ORDER MOMENTS 



39. (Problem courtesy of the ECE Department of the Technion.) Suppose 
that {Y (t); t G 3?} is a weakly stationary random process with 0 mean 
and autocorrelation function Ry{t) and that A is a random variable 
that is independent of Y{t) for all t. Define the random process 
{X(t); t G 3?} by X{t) = A Y(t). Consider the estimator for A 
defined by 

' 1 

A=- X{t)dt. ( 5 . 76 ) 

''' Jo 

(a) Show that E{A) = E{A). 

(b) Show that the mean squared error is given by 

E[{A-A)^] = (-)2 /^2 r RY{t-s)dtds 

''' Jo Jo 

-[ A-^)RY{T)dT. 

40. (Problem courtesy of the ECE Department of the Technion.) Let 
X{t) = S{t) + N{t) where S{t) is a deterministic signal that is 0 
outside the interval [— T, 0] and N{t) is white noise with zero mean 
and power spectral density No/2. The random process X{t) is passed 
through a linear filter with impulse response h{t) = S{—t), a time- 
reversed version of the signal. Let Y (t) denote the filter output pro- 
cess. 

(a) Find E[Y{t)]. 

(b) Find the covariance KyA, t-Gr). 

(c) Express the covariance function in terms of the mean function. 
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A Menagerie of Processes 



The basic tools for describing and analyzing random processes have all 
been developed in the proceeding chapters along with a variety of examples 
of random processes with and without memory. The goal of this chapter 
is to use these tools to describe a menagerie of useful random processes, 
usually by taking a simple random process and applying some form of signal 
processing such as linear filtering in order to produce a more complicated 
random process. In chapter 5 the effect of linear filtering on second order 
moments was considered, but in this chapter we look in more detail at the 
resulting output process and we consider other forms of signal processing 
as well. In the course of the development a few new tools and several 
variations on old tools for deriving distributions are introduced. Much of 
this chapter can be considered as practice of the methods developed in the 
previous chapters, with names often being given to the specific examples 
developed. In fact several processes with memory have been encountered 
previously: the Binomial counting process and the discrete time Wiener 
process, in particular. The goal now is to extend the techniques used in 
these special cases to more general situations and to introduce a wider 
variety of processes. 

The development of examples begins with a continuation of the study 
of the output processes of linear systems with random process inputs. The 
goal is to develop the detailed structure of such processes and of other pro- 
cesses with similar behavior that cannot be described by a liner system 
model. In chapter 5, we confined interest to second-order properties of the 
output random process, properties that can be found under quite general 
assumptions on the input process and filter. In order to get more detailed 
probabilistic descriptions of the output process, we next further restrict the 
input process for the discrete time case to be an iid random process and 
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study the resulting output process and the continuous time analog to such 
a process. By restricting the structure of the output process in this man- 
ner, we shall see that in some cases we can find complete descriptions of 
the process and not just the first and second moments. The random pro- 
cesses obtained in this way provide many important and useful models that 
are frequently encountered in the signal processing literature, including 
moving-average, autoregressive, autoregressive moving-average (ARMA), 
independent increment, counting, random walk, Markov, Wiener, Poisson, 
and Gaussian processes. Similar techniques are used for the development 
of a variety of random processes with markedly different behavior, the key 
tools being characteristic functions and conditional probability distribu- 
tions. This chapter contains extensive practice in derived distributions and 
in specifying random processes. 



6.1 Discrete Time Linear Models 

Many complicated random processes are well modeled as a linear operation 
on a simple process. For example, a complicated process with memory 
might be constructed by passing a simple iid process through a linear filter. 
In this section we define some general linear models that will be explored 
in some detail in the rest of the chapter. 

Recall that if we have a random process {A„; n G T} as input to a 
linear system described by a convolution, then as in equation (5.2) there is 
a pulse response such that the output process {M„} is given by 

r„ = ^ Xn.khk . (6.1) 

k: n—k^'T 

A linear filter with such a description — that is, one that can be defined 
as a convolution — is sometimes called a moving -average filter since the 
output is a weighted running average of the inputs. If only a finite number 
of the hk are not zero, then the filter is called a finite-order moving-average 
filter (or an FIR filter, for “finite impulse response,” in contrast to an HR 
or “infinite impulse response” filter). The order of the filter is equal to the 
maximum minus the minimum value of k for which the hk are nonzero. 
For example, if = X„ A„_i, we have a first-order moving-average 
filter. Although some authors reserve the term moving-average filter for a 
finite-order filter, we will use the broader definition we have given. A block 
diagram for such a filter is given in Figure 6.1. 

Several other names are used to describe finite-order moving-average 
filters. Since the output is determined by the inputs without any feedback 
from past or future outputs, the filter is sometimes called a feedforward or 
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hk^n— 



k 



Figure 6.1: Moving average filter 



tapped delay line or transversal filter. If the filter has a well-defined transfer 
function H{f) (e.g., it is stable) and if the transfer function is analytically 
continued to the complex plane by making the substitution 2 = then 

the resulting complex function contains only zeroes and no poles on the unit 
circle in the complex plane. For this reason such a filter is sometimes called 
an “all-zeroes” filter. This nomenclature really only applies to the Fourier 
transform or the z-transform confined to the unit circle. If one considers 
arbitrary z, then the filter can have zeroes at z = 0. 

In chapter 5 we considered only linear systems involving moving-average 
filters, that is, systems that could be represented as a convolution. This was 
because the convolution representation is well suited to second-order I/O 
relations. In this chapter, however, we will find that other representations 
are often more useful. Recall that a convolution is simply one example of a 
difference equation. Another form of difference equation describing a linear 
system is obtained by convolving the outputs to get the inputs instead 
of vice versa. For example, the output process may satisfy a difference 
equation of the form 



A 



n 



^ ^ ^k^n—k ■ 
k 



( 6 . 2 ) 



For convenience it is usually assumed that ag = 1 and = 0 for negative 







346 



CHAPTER 6. A MENAGERIE OF PROCESSES 



k and hence that the equation can be expressed as 

Yn = Xn- ^ akYn-k- (6.3) 

fc=l,2,... 

As in the moving-average case, the limits of the sum depend on the index 
set; e.g., the sum could be from k = — oo to oo in the two-sided case with 
T = Z OT from fc = — oo to n in the one-sided case with T = Z^. 

The numbers {ofc} are called regression coefficients, and the correspond- 
ing filter is called an autoregressive filter. If Uk ^ 0 for only a finite number 
of k, the filter is said to be finite-order autoregressive. The order is equal 
to the maximum minus the minimum value of k for which is nonzero. 
For example, if = Yn + Y„-i, we have a first-order regressive filter. As 
with the moving-average filters, for some authors the “finite” is implicit, 
but we will use the more general definition. A block diagram for such a 
filter is given in Figure 6.2. 




Figure 6.2: Autoregressive filter 

Note that, in contrast with a finite-order moving-average filter, a finite- 
order autoregressive filter contains only feedback terms and no feedforward 
terms — the new output can be found solely from the current input and 
past of future outputs. Hence it is sometimes called a feedback filter. If we 
consider a deterministic input and transform both sides of (6.2), then we 
find that the transfer function of an autoregressive filter has the form 

k 

where we continue to assume that ao = 1. Note that the analytic continu- 
ation of the transfer function into the complex plane with the substitution 
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^ _ gi27r/ finite-order autoregressive filter has poles but no zeroes on 
the unit circle in the complex plane. Hence a finite-order autoregressive 
filter is sometimes called an all-poles filter. An autoregressive filter may or 
may not be stable, depending on the location of the poles. 

More generally, one can describe a linear system by a general difference 
equation combining the two forms — moving-average and autoregressive — 
as in (A. 34): 

^ ^ O^kyn—k — ^ ^ • 

k i 

Filters with this description are called ARMA (for “autoregressive moving- 
average”) filters. ARMA filters are said to be finite-order if only a finite 
number of the aCs and bk’s are not zero. A finite-order ARMA filter is 
depicted in figure 6.3. 

bo 




Figure 6.3: Moving average filter 



Once again, it should be noted that some authors use finite-order im- 
plicitly, a convention that we will not adopt. Applying a deterministic input 
and using (A. 32), we find that the transfer function of an ARMA filter has 
the form 



H{.f) 



i 

k 



(6.4) 



where we continue to assume that Oq = 1. 
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As we shall see by example, one can often describe a linear system by 
any of these filters, and hence one often chooses the simplest model for 
the desired application. For example, an ARMA filter representation with 
only three nonzero Ofc and two nonzero bk would be simpler than either a 
pure autoregressive or pure moving-average representation, which would in 
general require an infinite number of parameters. The general development 
of representations of one type of filter or process from another is an area of 
complex analysis that is outside the scope of this book. We shall, however, 
see some simple examples where different representations are easily found. 

We are now ready to introduce three classes of random processes that 
are collectively called linear models since they are formed by putting an iid 
process into a linear system. 

A discrete time random process {lA} is called an autoregressive random 
proeess if it is formed by putting an iid random process into an autoregres- 
sive filter. Similarly, the process is said to be a moving -average random 
process or ARMA random process if it is formed by putting an iid process 
into a moving-average or ARMA filter, respectively. If a finite-order filter 
is used, the order of the process is the same as the order of the filter. 

Since iid processes are uncorrelated, the techniques of chapter 5 imme- 
diately yield the power spectral densities of these processes in the two-sided 
weakly stationary case and yield in general the second-order moments of 
moving-average processes. In fact, some books and papers which deal only 
with second order moment properties define an autoregressive (moving av- 
erage, ARMA) process more generally as the output of an autoregressive 
(moving average, ARMA) filter with a weakly stationary uncorrelated in- 
put. We use the stricter definition in order to derive actual distributions 
in addition to second order moments. We shall see that we can easily 
find marginal probability distributions for moving-average processes. Per- 
haps surprisingly, however, the autoregressive models will prove much more 
useful for finding more complete specifications, that is, joint probability dis- 
tributions for the output process. The basic ideas are most easily demon- 
strated in the simple, and familiar, example of example [5.3], summing 
successive outputs of an iid process. 



6.2 Sums of IID Random Variables 

We begin by recalling a simple but important example from chapters 3 
and 5: examples [3.35], [3.37], and [5.3]. These examples can be used to 
exemplify both autoregressive and moving average filters. Let {Ai„; n = 
1, 2, . . . } be an iid process with mean m and variance (with discrete or 
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continuous alphabet). Consider a linear filter with Kronecker delta response 



1 fc = 0,1,2... 
0 otherwise. 



This is the discrete time integrator and it is not stable. The output process 
is then given as the sum of iid random variables: 



" “ 1 w 1 o ■ ^ 

n=l,2,... 

The two best known members of this class are the Binomial counting process 
and the Wiener process (or discrete-time diffusion process), which were 
encountered in chapter 2 

We have changed notation slightly from example [5.3] since here we 
force Iq = 0- Observe that if we further let Xq = 0, then by definition 
{Yn', n S 2+} is a moving-average random process by construction with 
the moving-average filter = 1 for all nonnegative k. 

Since an iid input process is also uncorrelated, we can apply example 
[5.3] (with a slight change due to the different indexing) and evaluate the 
first and second moments of the Y process as 

EYn = mn ; n = 1, 2, . . . 



KvikO) = ; k,j = ^,2,... . 

For later use we state these results in a slightly different notation: Since 
EYi = m, since iFy(l, 1) = = cr^, and since the formulas also hold for 

n = 0 and for k = j = 0, we have that 

EYt = tEYi ; t > 0 (6.7) 



Kyit, s) = ay^ min(t, s) ; t, s > 0 . 



We explicitly consider only those values of t and s that are in the appropri- 
ate index set, here the nonnegative integers. An alternative representation 
to the linear system representation defined by (3.138) is obtained by rewrit- 
ing the sum as a linear difference equation with initial conditions: 



Yn-l + Xn n — 1, 2, 3, . . . 
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Observe that in this guise, {F„} is a first-order autoregressive process (see 
(6.2)) since it is obtained by passing the iid X process through a first-order 
autoregressive filter with Oq = 1 and Oi = — 1. Observe again that this filter 
is not stable, but it does have a transfer function, H{f) = 1/(1 — 

(which, with the substitution z = has a pole on the unit circle). 

We have seen from section 3.12 how to find the marginal distributions 
for such sums of iid processes and we have seen from sections 3. 7-3. 7. 2 how 
to find the conditional distributions and hence a complete specification. 
The natural question at this point is how general the methods and re- 
sults referred to are. Toward this end we consider generalizations in several 
directions. First we consider a direct generalization to continuous time pro- 
cesses, the class of processes with independent and stationary increments. 
We next consider partial generalizations to discrete time moving average 
and autoregressive processes. 



6.3 Independent Stationary Increments 

We now generalize the class of processes formed by summing iid random 
variables in a way that works for both continuous and discrete time. The 
generalization is accomplished by focusing on the changes in a process 
rather than on the values of the process. The general class, that of processes 
with independent and stationary increments, reduces in the discrete time 
case to the class considered in the previous sections: processes formed by 
summing outputs of an iid process. 

The change in value of a random process in moving forward in any 
given time interval is called a jump or increment of the process. The spe- 
cific class of processes that we now consider consists of random processes 
whose jumps or increments in nonoverlapping time intervals are indepen- 
dent random variables whose probability distributions depend only on the 
time differences over which the jumps occur. In the discrete time case, 
the nth output of such processes can be regarded as the sum of the first n 
random variables produced by an iid random process. Because the jumps 
in nonoverlapping time intervals then consist of sums of different iid ran- 
dom variables, the jumps are obviously independent. This general class 
of processes is of interest for three reasons: First, the class contains two 
of the most important examples of random processes: the Wiener pro- 
cess and the Poisson counting process. Second, members of the class form 
building blocks for many other random process models. For example, in 
chapter 5 we presented an intuitive derivation of the properties of continu- 
ous time Gaussian white noise. A rigorous development would be based on 
the Wiener process, which we can treat rigorously with elementary tools. 
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Third, these processes provide a useful vehicle for practice with several 
important and useful tools of probability theory: characteristic functions, 
conditional pmf’s, conditional pdf’s, and nonelementary conditional prob- 
ability. In addition, independent increment processes provide specific ex- 
amples of several general classes of processes: Markov processes, counting 
processes, and random walks. 

Independent and stationary increment processes are generally not them- 
selves weakly stationary since, as has already been seen in the discrete time 
case, their probabilistic description changes with time. They possess, how- 
ever, some stationarity properties. In particular the distributions of the 
jumps or increments taken over fixed-length time intervals are stationary 
even through the distributions of the process are not. 

The increments or jumps or differences of a random process are obtained 
by picking a collection of ordered sample times and forming the pairwise 
differences of the samples of the process taken at theses times. For example, 
given a discrete time or continuous time random process {1*; t S T}, one 
can choose a collection of sample times IqAit ■ ■ ti £ T all i, where we 
assume that the sample times are ordered in the sense that 



to < h < t2 < . . . < tk . 

Given this collection of sample times, the corresponding increments of the 
process {Yt} are the differences 

Yu-Yt,_, ; z= 1 , 2 ,... ,k . 

Note that the increments very much depend on the choice of the sample 
times; one would expect quite different behavior when the samples are 
widely separated than when they are nearby. We can now define the general 
class of processes with independent increments for both the discrete and 
continuous time cases. 

A random process {Yt}; t G T is said to have independent increments 
or to be an independent increment random process if for all choices of k and 
sample times {U; i = 1,' dots, k}, the increments Yt^ — Yt^_,, ; i = 1,2, . . . , k 
are independent random variables. An independent increment process is 
said to have stationary increments if the distribution of the increment 
Yf +5 — Tg+i does not depend on 5 for all allowed values of t > s and 5. 
(Observe that this is really only a first-order stationarity requirement on 
the increments, not by definition a strict stationarity requirement, but the 
language is standard. In any case, if the increments are independent and 
stationary in this sense, then they are also strictly stationary.) 

We shall call a random process an independent stationary increment or 
isi process if it has independent and stationary increments. 
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We shall always make the additional assumption that 0 is the smallest 
possible time index; that is, that t > 0 for all t G T, and that Iq = 0 as in 
the discrete time case. We shall see that such processes are not stationary 
and that they must “start” somewhere or, equivalently, be one-sided ran- 
dom process. We simply define the starting time as 0 for convenience and 
fix the starting value of the random process as 0, again for convenience. 
If these initial conditions are changed, the following development changes 
only in notational details. 

A discrete time random process is an isi process if and only if it can 
be represented as a sum of iid random variables, i.e., if it has the form 
considered in the proceeding sections. To see this, observe that if {T„} 
has independent and stationary increments, then by choosing sample times 
ti = i and defining — Y^-i for n = 1, 2, . . . , then the must be 

independent from the independent increment assumption, and they must 
be identically distributed from the stationary increment assumption. Thus 
we have that 

n n 

k^l k^l 

and hence Yn has the form of (6.6). Conversely, if is the sum of iid 
random variables, then increments will always have the form 

t 

Yt-Ys= ^ A, ; t>s , 



that is, the form of sums of disjoint collections of iid random variables, and 
hence they will be independent. Furthermore, the increments will clearly 
be stationary since they are sums of iid random variables; in particular, 
the distribution of the increment will depend only on the number of sam- 
ples added and not on the starting time. Thus all of the development for 
sums of iid processes could have been entitled “discrete time processes with 
independent and stationary increments.” 

Unfortunately, there is no such nice construction of continuous time in- 
dependent increment processes. The natural continuous time analog would 
be to integrate a memoryless process, but as with white noise, such memory- 
less processes are not well-defined. One can do formal derivations analogous 
to the discrete time case and sometimes (but not always) arrive at correct 
answers. We will use alternative and more rigorous tools when dealing with 
the continuous time processes. We do note, however, that while we cannot 
express a continuous time process with independent increments as the out- 
put of a linear system driven by a continuous time memoryless process, for 
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any collection of sample times to = 0, ■ , tfc we can write 

n 

= ■ ( 6 . 10 ) 

i=l 

and that the increments in the parentheses are independent — that is, we 
can write Yt„ as a sum of independent increments (in many ways, in fact) 
— and the increments are identically distributed if the time interval widths 
are identical for all increments. 

Since discrete time isi processes can always be expressed as the sum of 
iid random variables, their first and second moments always have the form 
of (6.7) and (6.8). In section 6.4 it was shown that (6.7) and (6.8) also holds 
for continuous time processes with stationary and independent increments! 

We again emphasize that an independent increment process may have 
stationary increments, but we already know from the moment calculations 
of (6.7) (6.8) that the process itself cannot be weakly stationary. Since 
the mean and covariance grow with time, independent increment processes 
clearly only make sense as one-sided processes. 

6.4 TirSecond- Order Moments of ISI Processes 

In this section we show that several important properties of the discrete 
time independent increment processes hold for the continuous time case. 
In the next section we generalize the specification techniques and give two 
examples of such processes - the continuous time Wiener process and the 
Poisson counting process. This section is devoted to the proof that (6.7) and 
(6.8) hold for continuous time processes with independent and stationary 
increments. The proof is primarily algebraic and can easily be skipped. 

We now consider a continuous time random process {Yf, t € T} where 
T = [0,oo), having independent stationary increments and initial condi- 
tion To = 0- The techniques used in this section can also be used for an 
alternative derivation of the discrete time results. 

First observe that given any time t and any positive delay or lag r > 0, 
we have that 



Yt+r = {Yt+r-Yt) + Yt , ( 6 . 11 ) 

and hence, by the linearity of expectation, 

EYt+r = E[Yt+r - Yt] + EYt . 

Since the increments are stationary, however, the increment Yt+r — Y) has 
the same distribution, and hence the same expectation as the increment 
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Yt — Yq = Yt, and hence 



EYt+r = EYr + EYt . 

This equation has the general form 

g(t + t) = g{T) + g{t) . (6.12) 

An equation of this form is called a linear functional equation and has 
a unique solution of the form g(t) = ct, where c is a constant that is 
determined by some boundary condition. Thus, in particular, the solution 
to (6.12) is 

g{t) = g{^)t- (6.13) 

Thus we have that the mean of a continuous time independent increment 
process with stationary increments is given by 

EYt = tm , t G T , (6.14) 

where the constant m is determined by the boundary condition 

m = EYi 

Thus (6.7) extends to the continuous time case. 

Since Yq = 0, we can rewrite (6.11) as 

Yt+r = {Yt+r - Yt) + {Yt - To) , (6.15) 

that is, we can express Yt+r as the sum of two independent increments. The 
variance of the sum of two independent random variables, however, is just 
the sum of the two variances. In addition, the variance of the increment 
Yfj^r — Yt is the same as the variance of Yr — Yq = Yt since the increments 
are stationary. Thus (6.15) implies that 

'^Yt+T - + ^Yt > 

which is again a linear functional equation and hence has the solution 

= ta^ (6.16) 

where the appropriate boundary condition is 



Knowing the variance immediately yields the second moment: 
E{Y^) = + {EYtf = ta"^ + {tmf . 



(6.17) 
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Consider next the autocorrelation function s). Choose t > s and 

write Yt as the sum of two increments as 



Yt = {Yt - Ys) + Ys , 



and hence 

Ry{t, s) = E[YtYs] = E[{Yt - n)n] + E[Y^^] 

using the linearity of expectation. The left term on the right is, however, the 
expectation of the product of two independent random variables since the 
increments Yt — Yg and Yg — To &re independent. Thus from theorem 4.3 the 
expectation of the product is the product of the expectations. Furthermore, 
the expectation of the increment Yt — Yg is the same as the expectation of 
the increment Yt-g — Yq = Yt-g since the increments are stationary. Thus 
we have from this, (6.14), and (6.17) that 

Rvit, s) = {t — s)msm + + {srnf' = sa^ + {tm){sm) . 

Repeating the development for the case t < s then yields 

i?y(t, s) = cr^ min(t, s) + {tm){sm) , (6.18) 

which yields the covariance 

RTy (t, s) = min(t, s) ; t,sGT, (6.19) 

which extends (6.8) to the continuous time case. 



6.5 Specification of Continuous Time ISI Pro- 
cesses 

The specification of processes with independent and stationary increments 
is almost the same in continuous time as it is in discrete time, the only real 
difference being that in continuous time we must consider more general col- 
lections of sample times. In discrete time the specification was constructed 
using the marginal probability function of the underlying iid process, which 
implies the pmf of the increments. In continuous time we have no under- 
lying iid process so we instead assume that we are given a formula for the 
cdf (pdf or pmf) of the increments; that is, for any t > s we have a cdf 

Fy^-yAv) = Fy^,_,^-Yo{v) = ^y|t_.|(y) (6.20) 

or, equivalently, the corresponding pmf pYt-vAy) fo'' ^ continuous ampli- 
tude process. 
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To specify a continuous time process we need a formula for the joint 
probability functions for all n and all ordered sample times On 

(that is, ti < tj if i < j). As in the discrete time case, we consider con- 
ditional probability functions. To allow both discrete and continuous al- 
phabet, we first focus on conditional cdf’s and find the conditional cdf 
P{Yt„ < = y„_ 2 ,...). Then, using (6.10) we can 

apply the techniques used in discrete time by simply replacing the sample 
times i by C for i = 0, 1, . . . ,n. That is, we define the random variables 
{Xn} by 



. ( 6 . 21 ) 

Then the {A„} are independent (but not identically distributed unless the 
times between adjacent samples are all equal), and 

n 

Yu = II (6-22) 

1 = 1 



and 



-^(^n — — Un—li ^tn -2 — yn—2i ■ • ■ ) — ^Xn {]Jn 2/n— l) 

= YYt„-Yt^_^{yn - yn-l) ■ (6.23) 

This conditional cdf can then be used to evaluate the conditional pmf or 
pdf as 



PYtJYt^_^,...,Yt^{yn\yn-l,--- ,yi) = PxAyn - yn-l) ('6 24') 

= PYt^-Yt„_^{yn - yn-l) 

or 

fYtJYt^_^,...,Yt^{yn\yn-l,--- ,yi) = fx„{yn-yn-l) (I10K\ 

= fY,„-Y„_,{yn-yn-l) ’ 

respectively. These can then be used to find the joint pmf’s or pdf’s as 
before as 



n 

/ni.... At„(yi) ■ ■ • )2/«) = YiX,-Yt,_^{yi - yi-i) 

i=l 

n 

PYt^,...,Yt„ ^W_PYt.-Yt^_^{yi - Vi-l) , 

2 = 1 
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respectively. Since we can thus write the joint probability functions for any 
finite collection of sample times in terms of the given probability function 
for the increments, the process is completely specified. 

The most important point of these relations is that if we are told that 
a process has independent and stationary increments and we are given a 
cdf or pmf or pdf for Yt = Yt — Yq, then the process is completely defined 
via the specification just given! Knowing the probabilistic description of 
the jumps and that the jumps are independent and stationary completely 
describes the process. 

As in discrete time, a continuous time random process {Yt} is called 
a Markov process if and only if for all n and all ordered sample times 
ti < t 2 <...< tn we have for all y„, Un-i, ■ ■ ■ that 

P{Ytn "E yn\Ytn-i ~ Un—lj Ytn -2 ~ 2/n— 2) ■ ■ • ) “ 



P{Yt„ < yn\Yt„_i = yn-l) 



(6.26) 



or equivalently, 

fYtJYt^_^,...,Yt^{yn\yn-l,--- ,?/l) = | ^ (j/n |2/n- 1 ) 

for continuous alphabet processes and 



PYt„\Yt„_^,... ,Yt^(yn\yn-l, 



■■ ,yi) = PYt„\Yt^_^{yn\yn-l) 



for discrete alphabet processes. Analogous to the discrete time case, con- 
tinuous time independent increment processes are Markov processes. 

We close this section with the two most famous examples of continuous 
time independent increment processes. 

[6.1 ] The Continuous Time Wiener process 

The Wiener process is a continuous time independent increment pro- 
cess with stationary increments such that the increment densities are 
Gaussian with zero mean; that is, for t > 0, 



IyM 



\/27rt(j^ 



; y&n . 



The form of the variance follows necessarily from the previously derived 
form for all independent increment processes with stationary increments. 
The specification for this process and the Gaussian form of the increment 
pdf’s imply that the Wiener process is a Gaussian process. 
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[6.2] The Poisson counting process is a continuous time discrete alphabet 
independent increment process with stationary increments such that 
the increments have a Poisson distribution; that is, for t > 0, 

^ ^ = ... . 

6.6 Moving- Average and Autoregressive Pro- 

cesses 

We have seen in the preceding sections that for discrete time random pro- 
cesses the moving-average representation can be used to yield the second- 
order moments and also can be used to find the marginal probability func- 
tion of independent increment processes. The general specification for inde- 
pendent increment processes, however, was found using the autoregressive 
representation. In this section we consider results for more general processes 
using virtually the same methods. 

First assume that we have a moving-average process representation de- 
scribed by (6.1). We can use characteristic function techniques to find a 
simple form for the marginal characteristic function of the output process. 
In particular, assuming convergence conditions are satisfied where needed 
and observing that Y„ is a weighted sum of independent random variables, 
the characteristic function of the output random process marginal distri- 
bution is calculated as the product of the transforms 

MyAju) = Y[Mh^x„.kiju) ■ 

k 

The individual transforms are easily shown to be 

Mh^x„_kUu) = E = Mx„_^{juhk) = Mx(juhk) ■ 



Thus 



= Y[^x{juhk) , (6.27) 

k 

where the product is, as usual, dependent on the index sets on which {W„} 
and {hk} are defined. 

Equation (6.27) can be inverted in some cases to yield the output cdf 
and pdf or pmf. Unfortunately, however, in general this is about as far as 
one can go in this direction, even for an iid input process. Attempts to find 
joint or conditional distributions of the output process by distributions this 
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or other techniques will generally be frustrated by the complexity of the 
calculations required. 

Part of the difficulty in finding conditional distributions lies in the 
moving-average representation. The techniques used successfully for the 
independent increment processes relied on an autoregressive representation 
of the output process. We will now show that the methods used work 
for more general autoregressive process representations. We will consider 
specifically causal autoregressive processes represented as in (6.3) so that 



Vji — Nji ^ ) ^kVn—k 
k>0 



By the independence and causality conditions, the {Y„-k} in the sum are 
independent of X„. Hence we have a representation for as the sum of two 
independent random variables, X„ and the weighted sum of the P’s. The 
latter quantity is treated as if it were a constant in calculating conditional 
probabilities for Y„. Thus the conditional probability of an event for P„ can 
be specified in terms of the marginal probability of an easily determined 
event for X„. Specifically, the conditional cdf for P„ is 



Pr[Pn < yn\yn-l,yn- 2 , ■ ■ •] 



Pr 



fc >0 



Fx I y^akyn-k 

k fc>0 



(6.28) 



where oq = 1. The conditional pmf or pdf can now be found. For example, 
if the input random process is continuous alphabet, the conditional output 
pdf is found by differentiation to be 



fY„\Yr,-i,Y„^,...{yn\yn-i,yri2,...) = fx ^^akyn-kj ■ (6.29) 

Finally, the complete specification can be obtained by a product of pmf’s or 
pdf’s by the chain rule as in (3.144) or (3.154). The discrete time indepen- 
dent increment result is obviously a special case of this equation. For more 
general processes, we need only require that the sum converge in (6.29) and 
that the corresponding conditional pdf’s be appropriately defined (using the 
general conditional probability approach). We next consider an important 
example of the ideas of this section. 
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6.7 The Discrete Time Gauss-Markov Pro- 
cess 



As an example of the development of the preceding section, consider the 
filter given in example [5.1]. Let {Ai„} be an iid Gaussian process with 
mean m and variance a^. The moving-average representation is 

OO 

Yn = Y. , (6.30) 

/c=0 



from which (6.27) can be applied to find that 



(jw) 



Y^^jiuP)m-l/2{uPfcr 



2 

X 




that is, a Gaussian random variable with mean my = = m/(l — r) 

k 

and variance ay = ~ moments found by the 

k 

second-order theory in example [5.1]. 

To find a complete specification for this process, we now turn to an 
autoregressive model. From (6.30) it follows that must satisfy the dif- 
ference equation 



+ rT„_i . (6.31) 

Hence {lA} is a first-order autoregressive source with Oq = 1 and Oi = — r. 
Note that as with the Wiener process, this process can be represented as 
a first-order autoregressive process or as an infinite-order, moving average 
process. In fact, the Wiener process is the one-sided version of this process 
with r = 1. 

Application of (6.29) yields 



/v»(y”) = fY„(yn\yn-l,yn-2, ■ ■ ■ )/v„_i(yn-l|yn-2,J/n-3, .■.)■.. fYiiVl) 

n 

= /Yi (yi ) n 



z=2 






n 



'T i=2 



V2- 



TTCT^ 



,-{yl)l{2al) - YA=AVi-rVi-A^ 



27TCT^ 



A/27r(Ty 



(6.32) 
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6.8 Gaussian Random Processes 

We have seen how to calculate the mean, covariance function, or spectral 
density of the output process of a linear filter driven by an input random 
process whose mean, covariance function, or spectral density is known. In 
general, however, it is not possible to derive a complete specification of the 
output process. We have seen one exception: The output random process 
of an autoregressive filter driven by an iid input random process can be 
specified through the conditional pmf’s or pdf’s, as in equation (6.29). In 
this section we develop another important exception by showing that the 
output process of a linear filter driven by a Gaussian random process — not 
necessarily iid — is also Gaussian. Thus simply knowing the output mean 
and autocorrelation or covariance functions — the only parameters of a 
Gaussian distribution — provides a complete specification. The underlying 
idea is that of theorem 4.4: a linear operation on a Gaussian vector yields 
another Gaussian random vector. The output vector mean and matrix 
covariance of the theorem are in fact just the vector and matrix versions of 
the linear system second-moment I/O relations (5.3) and (5.7)). The output 
of a discrete time FIR linear filter can be expressed as a linear operation on 
the input as in (4.26), that is, a finite dimensional matrix times an input 
vector plus a constant. Therefore we can immediately extend theorem 4.4 
to FIR filtering and argue that all finite dimensional distributions of the 
output process are Gaussian and hence the process itself must be Gaussian. 
It is also possible to extend theorem 4.4 to include more general impulse 
responses and to continuous time by using appropriate limiting arguments. 
We will not prove such extensions. Instead we will merely state the result 
as a corollary: 

Corollary 6.1 If a Gaussian random process {Xt} is passed through a 
linear filter, then the output is also a Gaussian random process with mean 
and covariance given by (5.3) and (5.7). 



6.9 7»rThe Poisson Counting Process 

An engineer encounters two types of random processes in practice. The 
first is the random process whose probability distribution depends largely 
on design parameters: the type of modulation used, the method of data 
coding used, etc. The second type of random processes have probability 
distributions that depend on naturally occurring phenomena over which the 
engineer has little if any control: noise in physical devices, speech wave- 
forms, the number of messages in a telephone system as a function of time, 
etc. Gentral limit theorems provide one example of such processes. This 
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chapter is devoted towards another example: the Poisson process. Here the 
basic Poisson counting process is derived from physical assumptions and a 
variety of properties are developed. Gaussian and Poisson processes pro- 
vide classes of random processes that characterize (at least approximately) 
the majority of naturally occurring random processes. The development 
of Poisson processes provides further examples of many of the techniques 
developed so far. 

Our intent here is to remove some of the mystery of the functional forms 
of two important distributions by showing how these apparently compli- 
cated distributional forms arise from nature. Therefore, the development 
presented is somewhat brief, without consideration of all the mathematical 
details. 

The Poisson counting process was introduced as an example of specifi- 
cation of an independent and stationary increment process. In this section 
the same process is derived from a more physical argument. 

Consider modeling a continuous time counting process {Nt; t > 0} with 
the following properties: 

1. iVo = 0 (the initial condition). 

2. The process has independent and stationary increments. Hence the 
changes, called jumps, during nonoverlapping time intervals are in- 
dependent random variables. The jumps in a given time interval are 
memoryless, and their amplitude does not depend on what happened 
before that interval. 

3. In the limit of very small time intervals, the probability of an incre- 
ment of 1, that is, of increasing the total count by 1, is proportional 
to the length of the time interval. The probability of an increment 
greater than 1 is negligible in comparison, e.g., is proportional to 
powers greater than 1 of the length of the time interval. 

These properties well describe many physical phenomena such as the 
emission of electrons and other subatomic particles from irradiated objects 
(remember vacuum tubes?), the arrival of customers at a store or phone 
calls at an exchange, and other phenomena where events such as arrivals 
or discharges occur randomly in time. The properties naturally capture 
the intuition that such events do not depend on the past and that for a 
very tiny interval, the probability of such an event is proportional to the 
length of the interval. For example, if you are waiting for a phone call, the 
probability of its happening during a period of r seconds is proportional to 
T. The probability of more than two phone calls in a very small period r 
is, however, negligible in comparison. 
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The third property can be quantified as follows: Let A be the propor- 
tionality constant. Then for a small enough time interval At, 

Pr{Nt+At -Nt = l)^ XAt 

Pr{Nt+At -Nt = 0)^1- XAt 



Pr{Nt+At -Nt>l)^0 . (6.33) 

The relations of (6.33) can be stated rigorously by limit statements, but 
we shall use them in the more intuitive form given. 

We now use the properties 1 to 3 to derive the probability mass function 
P 7 Vt-ATo(fc) = PNt(k) for an increment Nt — Nq, from the starting time at 
time 0 up to time t > 0 with Nq = 0. For convenience we temporarily 
change notation and define 



p{k,t) =PNt-No(k) ; t>0 . 

Let At be a differentially small interval as in (6.33), and we have that 

p{k, t + At) = 

k 

Pr{Nt = n) PT{Nt+At - Nt = k - n\Nt = n) . 

n— 0 

Since the increments are independent, the conditioning can be dropped so 
that, using (6.33), 

p{k, t + At) = 

k 

Pr{Nt = n) Pr{Nt+At - Nt = k - n) 

n— 0 

= p{k, t)(l — AAt) -I- p{k — 1, t)XAt , 
which with some algebra yields 

p{k, t + At) — p{k, t) 



At 



= p{k — 1, t)X — p{k, t)X . 



In the limit as At ^ 0 this becomes the differential equation 

^p(k, t) + Xp(k, t) = Xp(k — 1, t) , t > 0 . 
at 

The initial condition for this differential equation follows from the initial 
condition for the process, Nq = 0; i.e.. 



p{k,0) 



0 

1 



fcyf 0 

A: = 0 
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since this corresponds to Pr(A^o = 0) = 1- The solution to the differential 
equation with the given initial condition is 

PNt(k) = p(k,t) = ; fc = 0,1,2,. t > 0 . (6.34) 

(This is easily verified by direct substitution.) 

The pmf of (6.34) is the Poisson pmf, and hence the given properties 
produce the Poisson counting process. Note that (6.34) can be generalized 
using the stationarity of the increments to yield the pmf for k jumps in an 
arbitrary interval s as 

p^^_^^{k) = ; fc = 0,l,... ; t>s. (6.35) 

As developed in chapter 6, these pmf’s and the given properties yield a 
complete specification of the Poisson counting process. 

Note that sums of Poisson random variables are Poisson. This follows 
from the development of this section. That is, for any t > s > r, all three 
of the indicated quantities in {Nt — Ng) + {Ng — Nr) = Nt — Nr are Poisson. 
Thus the Poisson distribution is infinitely divisible in the sense defined 
at the end of the preceding section. Of course the infinite divisibility of 
Poisson random variables can also be verified by characteristic functions 
as in (4.101). Poisson random variables satisfy the requirements of the 
central limit theorem and hence it can be concluded that with appropriate 
normalization, the Poisson cdf approaches the Gaussian cdf asymptotically. 

6.10 Compound Processes 

So far the various processes with memory have been constructed by passing 
iid processes through linear filters. In this section a more complicated 
construction of a new process is presented which is not a simple linear 
operation. A compound process is a random process constructed from two 
other random processes rather than from a single input process. It is formed 
by summing consecutive outputs of an iid discrete time random process, 
but the number of terms included in the sum is determined by a counting 
random process, which can be discrete or continuous time. As an example 
where such processes arise, suppose that on a particular telephone line the 
number of calls arriving in t minutes is a random variable Nt . The resulting 
Nt calls have duration Xi,X 2 , ... ,Xj\[^. What is the total amount of time 
occupied by the calls? It is the random variable 

Nt 

Yt = Y,Xk- 
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Since this random variable is defined for all positive t, {Yf} is a random 
process, depending on two separate processes: a counting process {Nt} 
and an iid process {X„}. In this section we explore the properties of such 
processes. The main tool used to investigate such processes is conditional 
expectation. 

Suppose that [Nt] t > 0} is a discrete or continuous time counting 
process. Thus t is assumed to take on either nonnegative real values or 
nonnegative integer values. Suppose that {Xk} is an iid process and that 
the Xn are mutually independent of the Nf. Define the compound process 
{Yt; t > 0} by 



Yt = 




t = 0 
t > 0. 



(6.36) 



What can be said about the process Yt? From iterated expectation we have 
that the mean of the compound process is given by 



EYt = E[E{Yt\Nt)] 

Nt 

= E[E{J2Xk\Nt)] 

fc=i 

= E[NtE{X)] 

= E{Nt)E{X). (6.37) 



Thus, for example, if Nt is a binomial counting process with parameter p, 
and {Xn} is a Bernoulli process with parameter e, then E{Yk\Nk) = NkC 
and hence EYk = eE{Nk) = ekp. If Nt is a Poisson counting process with 
parameter A, then E(Yt) = E{Nt)E{X) = XtE{X). 

Other moments follow in a similar fashion. For example, the character- 
istic function of Yt can be evaluated using iterated expectation as 

MyAju) = 

= E[E{C^^*\Nt)] 

= E[Mx{ju)^^], (6.38) 



where we have used the fact that condtioned on Nt, Yt is the sum of Nt 
iid random variables with characteristic function Mx. To further evaluate 
this, we again need to assume a distribution for Nt. Suppose first that Nt 
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is a Binomial counting process. Then 

k 

E[Mx{ju)^’^] = y^^pNk{n)Mx{juY 



n—0 

k 



n^O ^ 

= {pMx(ju) + {1 - p)Y. 



( 6 . 39 ) 



Suppose instead that Nt is a continuous time Poisson counting process. 
Then 



E[Mx{ju)^*] = ^pxYn)Mx{juY 



n—0 

oo 



= E 



{\tY 



At 



-MxijuY 

n^O 

„-At (AtMx(ju))" 



E 



7,1 



n =0 



( 6 . 40 ) 



where we have invoked the Taylor series expansion for an exponential. 

Both of these computations involve very complicated processes, yet they 
result in closed form solutions of modest complication. Since the charac- 
teristic functions are known, the marginal distributions of such processes 
follow. Further properties of compound processes are explored in the prob- 
lems. 



6.11 TirExponential Modulation 

Lest the reader erroneously assume that all random process derived dis- 
tribution techniques apply only to linear operations on processes, we next 
consider an example of a class of processes generated by a nonlinear opera- 
tion on another process. While linear techniques rarely work for nonlinear 
systems, the systems that we shall consider form an important exception 
where one can find second-order moments and sometimes even complete 
specifications. The primary examples of processes generated in this way 
are phase-modulated (PM) and frequency (FM) Gaussian random processes 
and the Poisson random telegraph wave. 

Let {X(t)} be a random process and define a new random process 

Y{t) = aoe^'(ait-Ha 2 X(i)-he) ^ (g 4 ^) 
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where ao,ai, and 02 are fixed real constants and where 0 is a uniformly 
distributed random phase angle on [0, 27 t]. The process {T(t)} is called an 
exponential modulation of {X(t)}. Observe that it is a nonlinear function 
of the input process. Note further that, as defined, the process is a complex- 
valued random process, and hence we must modify some of our techniques. 
In some, but not all, of the interesting examples of exponentially modulated 
random processes we will wish to focus on the real part of the modulated 
process, which we will call 

U{t) = Re{Y{t)) = l/2Y{t) + l/2Y{t)* 



= ao cos{ait + a 2 X (t) + Q) . (6.42) 

In this form, exponential modulation is called phase modulation (PM) of a 
carrier of angular frequency ai by the input process {X(t)}. If the input 
process is itself formed by integrating another random process, say {W{t)}, 
then the U process is called the frequency modulation (FM) of the carrier by 
the W process. Phase and frequency modulation are extremely important 
examples of complex exponential modulation. 

A classic example of a random process arising in communications that 
can be put in the same form is obtained by setting 0 = 0 (with probability 
1), choosing m = 0, 02 = tt, and letting the input process be the Poisson 
counting process {N{t)}, that is, to consider the random process 

V(t) = ao(-l)'^(‘) . (6.43) 

This is a real-valued random process that changes value with every jump 
in the Poisson counting process. Because of the properties of the Poisson 
counting process, this process is such that jumps in nonoverlapping time 
windows are independent, the probability of a change of value in a differ- 
entially small interval is proportional to the length of the interval, and the 
probability of more than one change is negligible in comparison. It is usu- 
ally convenient to consider a slight change in this process, which makes it 
somewhat better behaved. Let Z be a binary random variable, indepen- 
dent of N{f) and taking values of -1-1 or —1 with equal probability. Then 
the random process Y (t) = ZV ft) is called the random telegraph wave and 
has long served as a fundamental example in the teaching of second-order 
random process theory. The purpose of the random variable Z is to remove 
an obvious nonstationarity at the origin and make the resulting process 
equally likely to have either of its two values at time zero. This has the 
obvious effect of making the process zero-mean. In the form given, it can 
be treated as simply a special case of exponential modulation. 
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We develop the second-order moments of exponentially modulated ran- 
dom processes and then apply the results to the preceding examples. We 
modify our definitions slightly to apply to the complex-valued random vari- 
able is defined as the vector consisting of the expectations of the real and 
imaginary parts; that is, if X = Re{X) +jIm{X), with Re{X) and Im{X) 
the real and imaginary parts of X, respectively, then 

EX = {ERe{X),EIm{X)) . 

In other words, the expectation of a vector is defined to be the vector 
of ordinary scalar expectations of the components. The autocorrelation 
function of a complex random process is defined somewhat differently as 

RY{t,s) = E[Y{t)Y{sy] , 

which reduces to the usual definition if the process is real valued. The 
autocorrelation in this more general situation is not in general symmetric, 
but it is Hermitian in the sense that 

Ryis.t) = RY{t,s)* . 

Being Hermitian is, in fact, the appropriate generalization of symmetry 
for developing a useful transform theory, and it is for this reason that the 
autocorrelation function includes the complex conjugate of the second term. 

It is an easy exercise to show that for the general exponentially modu- 
lated random process of (6.41) we have that 

EY{t) = 0 . 

This can be accomplished by separately considering the real and imaginary 
parts and using (3.126), exactly as was done in the AM case of chapter 5. 
The use of the auxiliary random variable Z in the random telegraph wave 
definition means that both examples have zero mean. Note that it is not 
true that equals ei(“i*+“ 2 SX(t)-i-£;e). expectation 

does not in general commute with nonlinear operations. 

To find the autocorrelation of the exponentially modulated process, ob- 
serve that 



E[Y(t)Y{s)*] = ag£:[eA“i(i-s)+“ 2 (x(t)-^(«)))] 



and hence 



i?y(t,s) = Mx{t)-x{s){ja2) ■ 



(6.44) 
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Thus the autocorrelation of the nonlinearly modulated process is given 
simply in terms of the characteristic function of the increment between 
the two sample times! This is often a computable quantity, and when 
it is, we can find the second-order properties of such processes without 
approximation or linearization. This is a simple result of the fact that 
the autocorrelation of an exponentially modulated process is given by an 
expectation of the exponential of the difference of two samples and hence 
by the characteristic function of the difference. 

There are two examples in which the computation of the characteristic 
function of the difference of two samples of a random process is particu- 
larly easy: a Gaussian input process and an independent increment input 
processes. 

If the input process {X(t)} is Gaussian with zero mean (for convenience) 
and autocorrelation function Rx{t,s), then the random variable X(t) — 
X(s) is also Gaussian (being a linear combination of Gaussian random 
variables)with mean zero and variance 

^xit)-x{s) = E[{X{t) - X(s))^] = Rx{t, t) + Rx{s, s) - 2Rx{t, s) . 

Thus we have shown that if {X(t)} is a zero-mean Gaussian random process 
with autocorrelation function Rx and if {Y (t)} is obtained by exponentially 
modulating {X(t)} as in (6.41), then 

Rvit^s) = aoe-^“'^‘”®^^x(t)-x(s)(ja2) = 



Q2gjoi(t-s)g-l/2a2(iix(t,t)+flx(s.s)-2i?x(i,s)) 



(6.45) 



Observe that this autocorrelation is not symmetric, but it is Hermitian. 

Thus, for example, if the input process is stationary, then so is the 
modulated process, and 



Ry{t) = a2gjaiTg-a=(Kx(0)-flx(T)) ^ 



(6.46) 



We emphatically note that the modulated process is not Gaussian. 

We can use this result to obtain the second-order properties for phase 
modulation as follows: 



Ru{t,s) = E[U{t)U{s)*] = E 



Y{t) + Y{t)* /y(s) + r(s)’ 

2 I 2 



i {E[Y{t)Y{sr] + E[Y{t)Y{s)] + i?[y(t)*y(s)*] + E[Y{trY{s)]) . 
Note that both of the middle terms on the right have the form 
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which evaluates to 0 because of the uniform phase angle. The remaining 
terms are Ryit, s) and Ryit, s)* from the previous development, and hence 

Ru{t, s) = l/2al cos(ai(t - 5 ))ei/ 2 a^(flx(t.t)+flx( 5 .s)- 2 flx(i.^)) ^ (5 47 ) 

and hence, in the stationary case, 

i?y(r) = l/2agcos(air)e-“^(^^(°)-^^(^)) . (6.48) 

As expected, this autocorrelation is symmetric. 

Returning to the exponential modulation case, we consider the second 
example of exponential modulation of independent increment processes. 
Observe that this overlaps the preceding example in the case of the Wiener 
process. We also note that phase modulation by independent increment 
processes is of additional interest because in some examples independent 
increment processes can be modeled as the integral of another process. For 
example, the Poisson counting process is the integral of a random telegraph 
wave with alphabet 0 and 1 instead of —1 and +1. (This is accomplished 
by forming the process (A(t) + l)/2 with X{t) the ±1 random telegraph 
wave.) In this case the real part of the output process is the FM modulation 
of the process being integrated. 

If {A(t)} is a random process with independent and stationary incre- 
ments, then the characteristic function of X{t) — A(s) with t > s is equal 
to that of X{t — s). Thus we have from (6.44) that for t > s and t = t — s, 

Ry{r) = Mx{T){ja2) ■ 

We can repeat this development for the case of negative lag to obtain 

Ry{r) = aoe^“i^Mx(|r|)(ja 2 ) • (6.49) 

Observe that this autocorrelation is not symmetric; that is, it is not true 
that Ry{—T) = Ry{T) (unless ai =0). It is, however, Hermitian. 

Equation (6.49) provides an interesting oddity: Even though the original 
input process is not weakly stationary (since it is an independent increment 
process), the exponentially modulated output is weakly stationary! For 
example, if {A(t)} is a Poisson counting process with parameter A, then 
the characteristic function is 



r){ju) = J2 



(Ar)'=e- 
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Thus if we choose oi = 0 and 02 = tt, then the modulated output process is 
the random telegraph wave with alphabet ±ao and hence is a real process. 
Equation (6.49) becomes 

i?y(r)aoe"2^l^l (6.50) 

Note that the autocorrelation (and hence also the covariance) decays expo- 
nentially with the delay. 

A complete specification of the random telegraph wave is possible and 
is left as an exercise. 

6.12 ^Thermal Noise 

Thermal noise is one of the most important sources of noise in communi- 
cations systems. It is the “front-end” noise in receivers that is caused by 
the random motion of electrons in a resistance. The resulting noise is then 
greatly amplified by the amplifiers that magnify the noise along with the 
possibly tiny signals. Thus the noise is really in the receiver itself and not 
in the atmosphere, as some might think, and can be comparable in ampli- 
tude to the desired signal. In this section we sketch the development of a 
model of thermal noise. The development provides an interesting example 
of a process with both Poisson and Gaussian characteristics. 

Say we have a uniform conducting cylindrical rod at temperature T. 
Across this rod we connect an ammeter. The random motion of electrons 
in the rod will cause a current I{t) to flow through the meter. We wish to 
develop a random process model for the current based on the underlying 
physics. The following are the relevant physical parameters: 

A = cross-sectional area of the rod 
L = length of the rod 
q = electron charge 

n = number of electrons per cubic centimeter 
a = average number of electron collisions with 
heavier particles per second (about 10^) 

m = mass of an electron 

... „ , , wa 

p = resistivity of the rod = — ^ 

R = resistance of the rod = — 

A 

K = Boltzmann’s constant 

The current measured will be due to electrons moving in the longitudinal 
direction of the rod, which we denote x. Let I4,fe(t) denote the component 
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of velocity in the x direction of the fcth electron at time t. The total current 
I{t) is then given by the sums of the individual electron currents as 



nAL 

k^l 



nAL 

y ? 

“ L/v^y) 

nAL 



E 



We assume that (1) the average velocity, EVx,k{t) = 0, all k,t; (2) 
Vx^k(t) and Vxj{s) are independent random variables for all k yf j; and (3) 
the Vx,k{t) have the same distribution for all k. 

The autocorrelation function of I{t) is found as 



nAL 2 

Ri{r) = E[I{t)I{t + r)] = ^ + r)] 

k^l ^ 



= "E^E[Vx{t)Vx{t + T)], (6.51) 

where we have dropped the subscript k since by assumption the distribution, 
and hence the autocorrelation function of the velocity, does not depend on 
it. 

Next assume that, since collisions are almost always with heavier parti- 
cles, the electron velocities before and after collisions are independent — the 
velocity after impact depends only on the momentum of the heavy particle 
that the electron hits. We further assume that the numbers of collisions in 
disjoint time intervals are independent and satisfy (6.33) with a change of 
parameter: 

Pr(no collisions in At) = (1 — a At) 

Pr(one collision in At) = aAt 

This implies that the number of collisions is Poisson and that from (6.35) 



Pr(a particle has k collisions in [t, t -I- r)) 




; tc = 0,l,2,... 



Thus if T > 0 and tV( is the number of collisions in [t, t -I- t), then, us- 
ing iterated expectation and the independence with mean zero of electron 
velocities when one or more collisions have occurred. 



= E{E[Vx{t)Vx{t + r)\Nt,r]) 

= E{Vx{tf)Pv{Nt,r = 0) + {EVx{t)fPv{Nt,r ^ 0) 
= E{Vx{tf)e-^^ . 

(6.52) 



E[Vx{t)Vx{t + T)\ 
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It follows from the equipartition theorem for electrons in thermal equi- 
librium at temperature T that the electron velocity variance is 

kT 

E{V^{tY) = — . (6.53) 

771 

Therefore, after some algebra, we have from (6.51) through (6.53) that 

Ri{t) = . 

K 

Thevinin’s theorem can the be applied to model the conductor as a 
voltage source with voltage E{t) = RI{t). The autocorrelation function of 
E{t) is 

Re{t = , 

an autocorrelation function that decreases exponentially with the delay r. 
Observe that as a ^ oo, Re{t) becomes a taller and narrower pulse with 
constant area 2kTR; that is, it looks more and more like a Dirac delta 
function with area 2kTR. Since the mean is zero, this implies that the 
process E{t) is such that samples separated by very small amounts are 
approximately uncorrelated. Thus thermal noise is approximately white 
noise. The central limit theorem can be used to show that the finite dimen- 
sional distributions of the process are approximately Gaussian. Thus we 
can conclude that an approximate model for thermal noise is a Gaussian 
white noise process! 



6.13 Ergodicity and Strong Laws of Large Num- 
bers 

We close this chapter on general random processes with memory with a 
statement of a general form of the strong law of large numbers. In order 
to state the theorem, another idea is needed — ergodicity. The notion 
of ergodicity is often described incorrectly in engineering-oriented texts on 
random processes. There is, however, some justification for doing so, the 
definition is extremely abstract and not very intuitive. The intuition comes 
with the consequences of assuming both ergodicity and stationarity, and it 
is these consequences that are often used as a definition. For completeness 
we provide the rigorous definition. We later consider briefly examples of 
processes that violate the definition. Before possibly obscuring the key 
issues with abstraction, it is worth pointing out a few basic facts: 

• The concept of ergodicity does not require stationarity, that is, a 
nonstationary process can be ergodic. 
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• Many perfectly good models of physical processes are not ergodic, yet 
they have a form of law of large numbers. In other words, nonergodic 
processes can be perfectly good and useful models. 

• The definition is in terms of the process distribution of the random 
process. There is no finite-dimensional equivalent definition of ergod- 
icity as there is for stationarity. This fact makes it more difficult to 
describe and interpret ergodicity. 

• iid processes are ergodic, i.e., ergodicity can be thought of as a gen- 
eralization of iid. 

Ergodicity is defined in terms of a property of events: an event F is said 
to be T-invariant if {xt; t G T} S E implies that also {xt+r', t G T} G F, 
i.e., if a sequence or waveform is in F, then so is the sequence or waveform 
formed by shifting by t. As an example, consider the discrete time random 
process event F consisting of all binary sequences having a limiting relative 
frequency of I’s of exactly p. Then this event is r-invariant for all integer 
T since changing the starting time of the sequence by a finite amount does 
not effect limiting relative frequencies. 

A random process {Xt; t G T} is ergodic if for any r all r-invariant 
events F have probability 1 or 0. In the discrete time case it suffices to 
consider only r = 1. 

In the authors’ view, the concept of ergodicity is the most abstract idea 
of this book, but its importance in practice makes it imperative that the 
idea at least be introduced and discussed. The reader interested in delving 
more deeply into the concept is referred to Billingsley’s classic book Ergodic 
Theory and Information[iS\ for a deep look at ergodicity and its implications 
for discrete time discrete alphabet random processes. Rather than try to 
provide further insight into the abstract definition, we instead turn to its 
implications, and then interpret from the implications what it means for a 
process to be ergodic or not. 

The importance of stationarity and ergodicity is largely due to the fol- 
lowing classic result of Birkhoff and Khinchine. 

Theorem 6.1 The Strong Law of Large Numbers (The Pointwise Ergodic 
Theorem) 

Given a discrete time stationary random process {A„; n G Z{ with 
finite expectation A(A„) = mx, then there is a random variable X with 
the property that 



^ OO 

lim ~y^ X„ 

r — T7 • ^ 



n — »^oo 77, 



n—0 



X with probability 1, 



(6.54) 
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that is, the limit exists. If the process is also ergodic, then X = mx and 
hence 

^ OO 

lim — > Xn = mx with probability 1. (6.55) 

n— Tl 

n —0 

The conditions also imply convergence in mean square (an L 2 or mean 
ergodic theorem); that is, 

. 00 

l.i.m. — Xn = X, (6.56) 

n—*oo Jl • ^ 

n=0 

but we shall focus on the convergence with probability 1 form. There are 
also continuous time versions of the theorem to the effect that under suitable 
conditions 

1 

lim — / X(t) fit = X with probability 1, (6.57) 

T^oo 1 Jo 

but these are much more complicated to describe because special conditions 
are needed to ensure the existence of the time average integrals. 

The strong law of large numbers shows that for stationary and ergodic 
processes, time averages converge with probability one to the corresponding 
expectation. Suppose that a process is stationary but not ergodic. Then 
the theorem is that time averages still converge, but possibly not to the 
expectation. Consider the following example of a random process which 
exhibits this behavior. Suppose that nature at the beginning of time flips a 
fair coin. If the coin ends up heads, she sends thereafter a Bernoulli process 
with parameter pi , that is, an iid sequence of coin flips with a probability pi 
of getting a head. If the original coin comes up tails, however, nature sends 
thereafter a Bernoulli process with parameter po 7 ^ Pi- In other words, you 
the observer are looking at the output of one of two iid processes, but you do 
not know which one. This is an example of a mixture random process, also 
sometimes called a doubly stochastic random process because of the random 
selection of a parameter followed by the random generation of a sequence 
using that parameter. Another way to view the process is as follows: Let 
{Un} denote the Bernoulli process with parameter pi and {W„} denote the 
Bernoulli process with parameter pq. Then the mixture process {Xn} is 
formed by connecting a switch at the beginning of time to either the {C/„} 
process or the {Wn} process, and soldering the switch shut. The point is 
you either see {Un} forever with probability 1/2, or you see {Wn} forever. 
A little elementary conditional probability shows that for any dimension k, 

PUo,... ,Uk-i (x) + PWo,... ,Wk-i (x) 



PXo.....Xfe_i(x) 



2 



(6.58) 
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Thus, for example, the probability of getting a head in the mixture process 
is PXo(l) = {Po +Pi)/2. Similarly, the probability of getting two heads 
in a row is Pxo,Xi{^, 1) = {Po + Pi)/2- Since the joint pmf’s for the two 
Bernoulli processes are not changed by shifting, neither is the joint pmf 
for the mixture process. Hence the mixture process is stationary and from 
the strong law of large numbers its relative frequencies will converge to 
something. Is the mixture process ergodic? It is certainly not iid For 
example, the probability of getting two heads in a row was found to be 
PXo.JCi(l, 1) = (Po +Pi)/2, which is not the same as pxo(l)PXi(l) = [{Po + 
Pi)/2]^ (unless po = pi), so that Xq and Xi are not independent! It could 
conceivably be ergodic, but is it? Suppose that {Xn} were indeed ergodic, 
than the strong law would say that the relative frequency of heads would 
have to converge to the probability of a head, i.e., to {po +pi)/2. But 
this is clearly not true since if you observe the outputs of you are 
observing a Bernoulli process of bias either po or pi and hence you should 
expect to compute a limiting relative frequency of heads that is either pq 
or pi, depending on which of the Bernoulli processes you are looking at. In 
other words, your limiting relative frequency is a random variable, which 
depends on Nature’s original choice of which process to let you observe. 
This explains one possible behavior leading to the general strong law: you 
observe a mixture of stationary and ergodic processes, that is, you observe 
a randomly selected stationary and ergodic process, but you do not a priori 
know which process it is. Since conditioned on this selection the strong law 
holds, relative frequencies will converge, but they do not converge to an 
overall expectation. They converge to a random variable, which is in fact 
just the conditional expectation given knowledge of which stationary and 
ergodic random process is actually being observed! Thus the strong law of 
large numbers can be quite useful in such a stationary but nonergodic case 
since one can estimate which stationary ergodic process is actually being 
observed by measuring the relative frequencies. 

A perhaps surprising fundamental result of random processes is that this 
special example is in a sense typical of all stationary nonergodic processes. 
The result is called the ergodic decomposition theorem and it states that 
under quite general assumptions, any nonergodic stationary process is in 
fact a mixture of stationary and ergodic processes and hence you are always 
observing a stationary and ergodic process, you just do not know in advance 
which one. In our coin example, you know you are observing one of two 
Bernoulli processes, but we could equally consider an infinite mixture by 
selecting p from a uniform distribution on (0,1). You do not know p in 
advance, but you can estimate it from relative frequencies. The interested 
reader can find a development of the ergodic decomposition theorem and 
its history in chapter 7 of [22]. 




6.14. PROBLEMS 



377 



The previous discussion implies that ergodicity is not required for the 
strong law of large numbers to be useful. The next question is whether or 
not stationarity is required. Again the answer is no. Given that the main 
concern is the convergence of sample averages and relative frequencies, it 
should be reasonable to expect that random processes could exhibit tran- 
sient or short term behavior that violated the stationarity definition, yet 
eventually dies out so that if one waited long enough the process would 
look increasingly stationarity. In fact one can make precise the notion of 
asymptotically stationary (in several possible ways) and the strong law ex- 
tends to this case. Again the interested reader is referred to chapter 7 of 
[22]. The point is that the notions of stationarity and ergodicity should 
not be taken too seriously since ergodicity can easily be dispensed with and 
stationarity can be significantly weakened and still have processes for which 
laws of large numbers hold so that time averages and relative frequencies 
have well defined limits. 



6.14 Problems 



1. Let {X„} be an iid process with a Poisson marginal pmf with param- 

eter A. Let {Yn} denote the induced sum process as in equation (6.6). 
Find the pmf for and find and ATy(t, s). 

2. Let {Xn} be an iid process. Define a new process {C/„} by 

Un = Xn Xn—l ^ U = 1, 2, . . . . 



Find the characteristic function and the pmf for C/„. Find Rjj(t,.s). 
Is {Un} an independent increment process? 

3. Let {Xn} be a ternary iid process with px^{+l) = Px„(— 1) = e/2 
and px„ (0) = 1 — e. Fix an integer N and define the “sliding average” 



1 



N-l 



“ TV X! ■ 

2=0 



(a) Find AA„, cr^^, Mx„(ju), and Kx{t,s). 

(b) Find EYn,a^^,MY„{ju). 

(c) Find the cross- correlation Rx.Y{t,s) = E[XtYs\. 

(d) Given (5 > 0 find a simple upper bound to Pr(|P„| > 5) in terms 
of N and e. 



4. Find the characteristic function Mu^{ju) for the {C/„} process of ex- 
ercise 5.2. 
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5. Find a complete specification of the binary autoregressive process of 
exercise 5.11. Prove that the process is Markov. (One name for this 
process is the binary symmetric Markov source.) 

6. A stationary continuous time random process {A(t)} switches ran- 
domly between the values of 0 and 1. We have that 

Pr(A(t) = l)=Pr(A(t) = 0) = i , 
and if Nt is the number of changes of output during (0,t], then 



1 + at V 1 + of 



where a > 0 is a fixed parameter. (This is called the Bose-Einstein 
distribution.) 



(a) Find MN^{ju), ENt, and 

(b) Find EX(t) and Rx{t,s). 

7. Given two random processes {W}, called the signal process, and 
{Nt}, called the noise process, define the process {Yt} by 



Yt = Xt + Nt- 



The {Yt} process can be considered as the output of a channel with 
additive noise where the {Xt} process is the input. This is a common 
model for dealing with noisy linear communication systems; e.g., the 
noise may be due to atmospheric effects or to front-end noise in a 
receiver. Assume that the signal and noise processes are independent; 
that is, any vector of samples of the N process. Find the characteristic 
function, mean, and variance of Yt in terms of those for Xt and Nt- 
Find the covariance of the output process in terms of the covariances 
of the input and noise process. 



8. Find the inverse of the covariance matrix of the discrete time Wiener 
process, that is, the inverse of the matrix {min(A:,j); k= 1,2,... ,n, j = 
1,2,... ,n}. 



9. Let {A(t)} be a Gaussian random process with zero mean and auto- 
correlation function 

Rx{r) = . 



Is the process Markov? Find its power spectral density. Let Y{t) be 
the process formed by DSB-SG modulation of X{t) as in (5.37) with 
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ao = 0. If the phase angle 0 is assumed to be 0, is the resulting 
modulated process Gaussian? Letting 0 be uniformly distributed, 
sketch the power spectral density of the modulated process. Find 
MY{o){ju). 

10. Let {X(t)} and {F"(t)} be the two continuous time random processes 
of exercise 5.14 and let 

W{t) = X (t) cos{2tt fot) + y (t) sin(27r/ot) , 

as in that exercise. Find the marginal probability density function 
fw(t) and the joint pdf fw(t),w{s){u,v). Is {W{t)} a Gaussian pro- 
cess? Is it strictly stationary? 

11. Let {iVfe} be the binomial counting process and define the discrete 
time random process {Yn} by 

r„ = (-1)^" . 

(This is the discrete time analog to the random telegraph wave.) Find 
the autocorrelation, mean, and power spectral density of the given 
process. Is the process Markov? 

12. Find the power spectral density of the random telegraph wave. Is 
this process a Markov process? Sketch the spectrum of an amplitude 
modulated random telegraph wave. 

13. Suppose that {U, W) is a Gaussian random vector with EU = EW = 
0, E{U'^) = E{W'^) = and E{UW) = pa"^. (The parameter p 
has magnitude less than or equal to 1 and is called the correlation 
coejficient.) Define the new random variables 

s = u + w 

D = U-W 

(a) Find the marginal pdf’s for S and D. 

(b) Find the joint pdf fs,D{o:,f3) or the joint characteristic function 

Are S and D independent? 

14. Suppose that K is a, random variable with a Poisson distribution, that 
is, for a fixed parameter A 






Pr{K = k) = pk{k) 



kl 



; /c = 0,1,1,... 
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(a) Define a new random variable N hy N = K + 1. Find the char- 
acteristic function the expectation EN, and the pmf 

Pnin) for the random variable N . 

We define a one-sided discrete time random process {Y„; n = 
1,2,...} as follows: has a binary alphabet {—1,1}. Yq is 

equally likely to be —1 or -|-1. Given Yq has some value, it will 
stay at that value for a total of Ti has the same distributions 
N , and then it will change sign. It will stay at the new sign for 
a total of T 2 time units, where T 2 has the same distribution as 
N and is independent of Ti, and then change sign again. It will 
continue in this way, that is, it will change sign for the time 
at time 

k 

Sk = Y.^^ , 

i=l 

where the R from an iid sequence with the marginal distribution 
found in part (a). 

(b) Find the characteristic function Ms^{ju) and the pmf ps^(jn) 
for the random variable Sk- Is jS'^} an independent increment 
process? 

15. Suppose that {Zn} is a two-sided Bernoulli process, that is, an iid se- 
quence of binary {0, 1} random variables with Pr(Z„ = 1) = Pr(Z„ = 
0). Define the new processes 

= (- 1 )^" , 



= ; n = 0,l,2,... , 

i^O 

ind 

00 

y„ = ^2-W„_, ; neZ . 

i=0 

(a) Find the means and autocorrelation functions of the {X„,} pro- 
cess and the {h^}process. If possible, find the power spectral 
densities. 

(b) Find the characteristic functions for both and Vn- 
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(c) Is {Yn} an autoregressive process? a moving average process? Is 
it weakly stationary? Is V„ an autoregressive process? a moving 
average process? Is it weakly stationary? {Note: answers to 
parts (a) and (b) are sufficient to answer the stationarity ques- 
tions, no further computations are necessary.) 

(d) Find the conditional pmf 

Py„|V„_i,y„_2,... ,Voi^n\Vn-l, • ■ • , t'o) 

Is {Vn} a Markov process? 

16. Suppose that {Z„} and {kF„} are two mutually independent two- 
sided zero mean iid Gaussian processes with variances cr^ and 
respectively. is put into a linear time-invariant filter to form an 
output process {Xn} defined by 

Xji = Zyi tZji—\., 

where 0 < r < 1. (Such a filter is sometimes called a preemphasis 
filter in speech processing.) This process is then used to form a new 
process 

r„ = + iT„, 

which can be viewed as a noisy version of the preemphasized 
process. Lastly, the process is put through a “deemphasis filter” 
to form an output process [/„ defined by 

Un = fUn-l + Yn. 

(a) Find the autocorrelation Rz and the power spectral density Sz. 
Recall that for a weakly stationary discrete time process with 
zero mean Rz{k) = E{ZnZn+k) and 

OO 

Sz(f)= Rz{k)e-^^-f\ 

k— — oo 

the discrete time Fourier transform of Rz. 

(b) Find the autocorrelation Rx and the power spectral density Sx- 

(c) Find the autocorrelation Ry and the power spectral density Sy. 

(d) Find the conditional pdf /y„|x„(j/k)- 

(e) Find the pdf fu^.Wn (or the corresponding characteristic func- 
tion Mu^,w^juOv))- 
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(f) Find the overall mean squared error E[{Un — Zn)"^]. 

17. Suppose that {Nt} is a binomial counting process and that {X„} is 
an iid process that is mutually independent of {iVt}. Assume that the 
Xn have zero mean and variance Let Yfc denote the compound 
process 

Nk 

i=l 

Use iterated expectation to evaluate the autocorrelation function Ryit, s). 

18. Suppose that {1U„} is a discrete time Wiener process. What is 

the minimum mean squared estimate of Wn given 1 U„_ 2 , . ■ • ? 

What is the linear least squares estimator? 

19. Let {Xn} be an iid binary random process with Pr(A„ = ±1) = 

1/2 and let {Nt} be a Poisson counting process. A continuous time 
random walk Y (t) can be defined by 



Nt 

= t>0. 

Find the expectation, covariance, and characteristic function of Yj. 

20. Are compound processes independent increment processes? 

21. Suppose that [Nt] t > 0} is a process with independent and station- 
ary increments and that 

PNt(k) = ^ y ; A: = 0,1,2,... . 

Suppose also that [Lt] t > 0} is a process with independent and 
stationary increments and that 

Assume that the two processes Nt and Lt are mutually independent 
of each other and define for each t the random variable R = Nt + 
Lt- It might model, for example, the number of requests for cpu 
cycles arriving from two independent sources, each of which produces 
requests according to a Poisson process. 

(a) What is the characteristic function for A? What is the corre- 
sponding pmf? 
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(b) Find the mean and covariance function of {It}. 

(c) Is {It; t > 0} an independent increment process? 

(d) Suppose that Z is a discrete random variable, independent of 
Nt, with probability mass function 

ct^ 

Pz{k)= = 

as in the first problem. Find the probability P{Z = Nt). 

(e) Suppose that {Z„} is an iid process with marginal pmf pz{k) as 
in the previous part. Define the compound process 

Nt 

Yt = Y,Zk. 

k=0 

Find the mean E{Yt) and variance 

22. Suppose that {X„; n G Z} is a discrete time iid Gaussian random 
processes with 0 mean and variance = E[Xq]. We consider this 
an input signal to a signal processing system. Suppose also that 
{Wn] n G Z} is a discrete time iid Gaussian random processes with 
0 mean and variance cr^ and that the two processes are mutually 
independent. bF„ is considered to be noise. Suppose that Xn is put 
into a linear filter with unit pulse response h, where 

{ 1 A: = 0 
-1 k = -1 
0 otherwise 

to form an output U = X * h, the convolution of the input signal and 
the unit pulse response. The final output signal is then formed by 
adding the noise to the filtered input signal, = [/„ + W„. 

(a) Find the mean, power spectral density, and marginal pdf for U„. 

(b) Find the joint pdf (a, /?). You can leave your answer in 

terms of an inverse matrix A“^, but you must accurately describe 

A. 

(c) Find the mean, covariance, and power spectral density for Y„. 
(d) Find ElYnXrt]. 

(e) Does the mean ergodic theorem hold for {Y„}? 




384 



CHAPTER 6. A MENAGERIE OF PROCESSES 



23. Suppose that {X{t); t e 7^} is a weakly stationary continuous time 
Gaussian random processes with 0 mean and autocorrelation function 

Rx{T)=E[X{t)X{t + T)]=ale-\^\. 

(a) Define the random process {Y{t); t € 7^} by 

Y{t)= [ X{a)da, 

Jt-T 

where T > 0 is a fixed parameter. (This is a short term integra- 
tor.) Find the mean and power spectral density of {F(t)}. 

(b) For fixed t > s, find the characteristic function and the pdf for 
the random variable X{t) — X{s). 

24. Consider the process {Xk] fc = 0, 1, • • • } defined by Xq = 0 and 

Xk+i = aXk + Wk , k>0 (6.59) 

where a is a constant, {Wk; fc = 0, 1, • • • } is a sequence of iid Gaussian 
random variables with E{Wk) = 0 and E{W^) = a^. 

(a) Calculate E{Xk) for k>0. 

(b) Show that Xk and Wk are uncorrelated for /c > 0. 

(c) By squaring both sides of (6.59) and taking expectation, obtain 
a recursive equation for Kx{k, k). 

(d) Solve for Kx{k, k) in term of a and a. Hint: distinguish between 
a = 1 and a yf 1. 

(e) Is the process {Xk] k = 1, 2, • • • } weakly stationary? 

(f) Is the process [Xk] k = 1, 2, • • • } Gaussian? 

(g) For — I < a < I, show that 

P{\X^\ > 1 ) < 

I — 

25. A distributed system consists of N sensors which view a common ran- 
dom variable corrupted by different observation noises. In particular, 
suppose that the ith sensor measures a random variable 

W, = X + Yi, i = 0,l,2,-- - ,iV-l, 

Where the random variables X, Yi, - ■ ■ , Yx are all mutually indepen- 
dent Gaussian random variables with 0 mean. The variance of X is 




6.14. PROBLEMS 



385 



1 and the variance of Yi is r* for a fixed parameter |r| < 1. The 
observed data are gathered at a central processing unit to form an 
estimate of the unknown random variable X as 

1 

i=0 

(a) Find the mean, variance, and probability density function of the 
estimate X^- 

(b) Find the probability density function /<;„ (a) of the error 

6n = X — Xn. 

(c) Does Xn converge in probability to the true value XI 

26. Suppose that [Nt] t > 0} is a process with independent and station- 
ary increments and that 

PNAk) = ^ y ; ft = 0,1,2,... . 

(a) What is the characteristic function for Ntl 

(b) What is the characteristic function for the increment Nt — 
for t > si 

(c) Suppose that K is a discrete random variable, independent of 
Nt, with probability mass function 

Pvik) = (1 -p)p^, ft = 0,1,... . 

Find the probability P(Y = Nt). 

(d) Suppose that we form the discrete time process {Xn n= 1,2,...} 
by 

Xn = N2n — N2(n-1)- 
What is the covariance of Xnl 

(e) Find the conditional probability mass function 

PX„|W2(„_i)(ft|w). 

(f) Find the expectation 

1 
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27. Does the weak law of large numbers hold for a random process consist- 
ing of Nature selecting a bias uniformly on [0, 1] and then a coin with 
that bias is flipped forever? In any case, is it true that Sn converges? 
If so, to what? 



28. Suppose that {X(t)} is a continuous time weakly stationary Gaus- 
sian random process with zero mean and autocorrelation function 
Rx{t) = where a > 0. The signal is passed through an RC 

Alter with transfer function 



H{f) 



0 

P + j^nf ’ 



where (3 = IjRC, to form an output process {y)t)}. 



(a) Find the power spectral densities Sx{f) and S'y(/)? 

(b) Evaluate the average powers E[X'^(t)] and E[Y'^{t)]. 

(c) What is the marginal pdf /v(i)? 

(d) Now form a discrete time random process {W„} by Wn = X{nT), 
for all integer n. This is called sampling with a sampling period 
of T. Find the mean, autocorrelation function, and, if it exists, 
the power spectral density of {Wn}- 

(e) Is {Y{t)} a Gaussian random process? Is {Wn} a Gaussian ran- 
dom process? Are they stationary in the strict sense? 

(f) Let {iVt} be a Poisson counting process. Let i{t) be the deter- 
ministic waveform defined by 



i{t) 



1 if t G [0, (5] 
0 otherwise 



— that is, a flat pulse of duration S. For k = 1, 2, . . . , let tfc 
denote the time of the jump in the counting process (that 
is, tk is the smallest value of t for which Nt = k). Define the 
random process {T(t)} by 



Nt 

■ 

k=l 

This process is a special case of a class of processes known as 
Altered Poisson processes. This particular example is a model 
for shot noise in vacuum tubes. Draw some sample waveforms 
of this process. Find MY(t){ju) and pY(t){n). 

Hint: You need not consider any properties of the random vari- 
ables {tk} to solve this problem. 




6.14. PROBLEMS 



387 



29. In the physically motivated development of the Poisson counting pro- 
cess, we fixed time values and looked at the random variables giving 
the counts and the increments of counts at the fixed times. In this 
problem we explore the reverse description: What if we fix the counts 
and look at the times at which the process achieves these counts? For 
example, for each strictly positive integer fc, let denote the time 
that the count occurs; that is, = a if and only if 

Na = k ] N < k ; allt<a. 



Define tq = 0. For each strictly positive integer k, define the interar- 
rival times Tk by 

'^k — f’k—l 1 



and hence 



k 

rk = '^n ■ 

i=l 



(a) Find the pdf for for /c = 1, 2, . . . . 

Hint: First find the cdf by showing that 

Prk (c«) = count occurs before or at time a) 

= Pr{N^ > k) , 

and then using the Poisson pmf to write an expression for this 
sum, differentiate to find the pdf. You may have to do some 
algebra to reduce the answer to a simple form not involving any 
sums. This is most easily done by writing a difference of two 
sums in which all terms but one cancel. The final answer is 
called the Erlang family of pdf’s. You should find that the pdf 
or ri is an exponential density. 

(b) Use the basic properties of the Poisson counting process to prove 
that the, interarrival times are iid 

Hint: Prove that 



hi ,... iTn — 1 (^1/^1 ; * ■ * j/^n— l) 

Fr„{a) = 1 - ; n= 1,2,... ; a > 0 . 
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Appendix A 

Preliminaries: Set Theory, 
Mappings, Linear 
Algebra, and Linear 
Systems 



The theory of random processes is constructed on a large number of abstrac- 
tions. These abstractions are necessary to achieve generality with precision 
while keeping the notation used manageably brief. Students will probably 
find learning facilitated if, with each abstraction, they keep in mind (or on 
paper) a concrete picture or example of a special case of the abstraction. 
From this the general situation should rapidly become clear. Concrete ex- 
amples and exercises are introduced throughout the book to help with this 
process. 



A.l Set Theory 

In this section the basic set theoretic ideas that are used throughout the 
book are introduced. The starting point is an abstract space, or simply a 
space, consisting of elements or points, the smallest quantities with which 
we shall deal. This space, often denoted by fl, is sometimes referred to as 
the universal set. To describe a space we may use braces notation with 
either a list or a description contained within the braces { }. Examples 
are: 

[A.O] The abstract space consisting of no points at all, that is, an empty 



389 




390 



APPENDIX A. PRELIMINARIES 



(or trivial) space. This possibility is usually excluded by assuming 
explicitly or implicitly that the abstract space is nonempty, that is, 
to contain at least one point. 

[A.l] The abstract space with only the two elements zero and one to de- 
note the possible receptions of a radio receiver of binary data at one 
particular signaling time instant. Equivalently, we could give different 
names to the elements and have a space {0, 1}, the binary numbers, 
or a space with the elements heads and tails. Clearly the structure 
of all of these spaces is the same; only the names have been changed. 
They are different, however, in that one is numeric, and hence we 
can perform arithmetic operations on the outcomes, while the other 
is not. Spaces which do not have numeric points (or points labeled by 
numeric vectors, sequences, or waveforms) are sometimes referred to 
as categorical. Notationally we describe these spaces as {zero, one}, 
{0,1}, and {heads, tails}, respectively. 

[A.2] Given a fixed positive integer k, the abstract space consisting of all 
possible binary fc— tuples, that is, all 2^ /c— dimensional binary vectors. 
This space could model the possible sequences of k flips of the same 
coin or a single flip of k coins. Note the example [A.l] is a special 
case of example [A.2]. 

[A. 3] The abstract space with elements consisting of all infinite sequences 
of ones and zeros or I's and O's denoting the sequence of possible 
receptions of a radio receiver of binary data over all signaling times. 
The sequences could be one-sided in the sense of beginning at time 
zero and continuing forever, or they could be two-sided in the sense 
of beginning in the infinitely remote past (time — oo) and continuing 
into the infinitely remote future. 

[A.4] The abstract space consisting of all ASCII (American Standard Code 
for Information Interchange) codes for characters (letters, numerals, 
and control characters such as line feed, rub out, etc.). These might 
be in decimal, hexadecimal, or binary form. In general, we can con- 
sider this space as just a space {ui, i = 1,. . . , N} containing a finite 
number of elements (which here might well be called symbols, letters, 
or characters). 

[A.5] Given a fixed positive integer k, the space of all fc— dimensional vec- 
tors with coordinates in the space of example [A.4]. This could model 
all possible contents of an ASCII buffer used to drive a serial printer. 

[A.6] The abstract space of all infinite (single-sided or double-sided) se- 
quences of ASCII character codes. 




A.l. SET THEORY 



391 



[A.7] The abstract space with elements consisting of all possible voltages 
measured at the output of a radio receiver at one instant of time. 
Since all physical equipment has limits to the values of voltage (called 
“dynamic range”) that it can support, one model for this space is a 
subset of the real line such as the closed interval [—V, V] = {r : —V < 
r < V}, i.e., the set of all real numbers r such that —V < r < +V. If, 
however, the dynamic range is not precisely known or if we wish to 
use a single space as a model for several measurements with different 
dynamic ranges, then we might wish to use the entire real line 3? = 
(— 00 , 00 ) = {r : —00 < r < 00 }. The fact that the space includes 
“impossible” as well as “possible” values is acceptable in a model. 

[A.8] Given a positive integer k, the abstract space of all /c— dimensional 
vectors with coordinates in the space of example [A.7]. If the real 
line is chosen as the coordinate space, then this is A:— dimensional 
Euclidean space. 

[A.9] The abstract space with elements being all infinite sequences of mem- 
bers of the space of example [A.7], e.g., all single-sided real-valued 
sequences of the form {x„,n = 0,1,2,...}, where G 3? for all 
n = 1,2,... 

[A.IO] Instead of constructing a new space as sequences of elements from 
another space, we might wish to consider a new space consisting of 
all waveforms whose instantaneous values are elements in another 
space, e.g., the space of all waveforms G (— 00 , 00 )}, where 

x{t) G 3?, all t. This would model, for instance, the space of all possible 
voltage-time waveforms at the output of a radio receiver. Examples 
of members of this space are x{t) = coswt, z{t) = e'**, x{t) = 1, 
x(t) = t, and so on. As with sequences, the waveforms may begin in 
the remote past or they might be defined for t running from 0 to 00 . 

The preceding examples focus on three related themes that will be con- 
sidered throughout the book: Examples [A.l], [A. 4], and [A.7] present mod- 
els for the possible values of a single measurement. The mathematical model 
for such a measurement with an unknown outcome is called a random vari- 
able. Such simple spaces describe the possible values that a random variable 
can assume. Examples [A. 2], [A. 5], and [A.8] treat vectors (finite collections 
or finite sequences) of individual measurements. The mathematical model 
for such a vector-valued measurement is called a random vector. Since a 
vector is made up of a finite collection of scalars, we can also view this 
random object as a collection (or family) of random variables. These two 
viewpoints — a single random vector-valued measurement and a collection 
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of random scalar-valued measurements — will both prove useful. Exam- 
ples [A. 3], [A. 6], and [A. 9] consider infinite sequences of values drawn from 
a common alphabet and hence the possible values of an infinite sequence 
of individual measurements. The mathematical model for this is called a 
random process (or a random sequence or a random time series). Example 
[A. 10] considers a waveform taking values in a given coordinate space. The 
mathematical model for this is also called a random process. When it is 
desired to distinguish between random sequences and random waveforms, 
the first is called a discrete time random process and the second is called a 
continuous time random process. 

In chapter 3 we shall define precisely what is meant by a random vari- 
able, a random vector, and a random process. For now, random variables, 
random vectors, and random processes can be viewed simply as abstract 
spaces such as in the preceding examples for scalars, vectors, and sequences 
or waveforms together with a probabilistic description of the possible out- 
comes, that is, a means of quantifying how likely certain outcomes are. It 
is a crucial observation at this point that the three notions are intimately 
connected: random vectors and processes can be viewed as collections or 
families of random variables. Conversely, we can obtain the scalar ran- 
dom variables by observing the coordinates of a random vector or random 
process. That is, if we “sample” a random process once, we get a ran- 
dom variable. Thus we shall often be interested in several different, but 
related, abstract spaces. For example, the individual scalar outputs may 
be drawn from one space, say A, which could be any of the spaces in ex- 
amples [A.l], [A. 4], or [A. 7]. We then may also wish to look at all possible 
fc— dimensional vectors with coordinates in A, a space that is often denoted 
by or at spaces of infinite sequences of waveforms of A. These latter 
spaces are called product spaces and will play an important role in modeling 
random phenomena. 

Usually one will have the option of choosing any of a number of spaces 
as a model for the outputs of a given random variable. For example, in 
flipping a coin one could use the binary space {head, tail}, the binary space 
{0, 1} (obtained by assigning 0 to head and 1 to tail), or the entire real line 
3?. Obviously the last space is much larger than needed, but it still captures 
all of the possible outcomes (along with many “impossible” ones). Which 
view and which abstract space is the “best” will depend on the problem at 
hand, and the choice will usually be made for reasons of convenience. 

Given an abstract space, we shall consider groupings or collections of 
the elements that may be (but are not necessarily) smaller than the whole 
space and larger than single points. Such groupings are called sets. If every 
point in one set is also a point in a second set, then the first set is said 
to be a subset of the second. Examples (corresponding respectively to the 
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previous abstract space examples) are: 

[A.ll] The empty set 0 consisting of no points at all. Thus we could 
rewrite example [A.O] as = 0. By convention, the empty set is 
considered to be a subset of all other sets. 

[A.12] The set consisting of the single element one. This is an example of 
a one-point set or singleton set. 

[A.13] The set of all A:— dimensional binary vectors with exactly one zero 
coordinate. 

[A.14] The set of all infinite sequences of ones and zeros with exactly 50% 
of the symbols being one (as defined by an appropriate mathematical 
limit). 

[A.15] The set of all ASCII characters for capital letters. 

[A.16] The set of all four-letter English words. 

[A.17] The set of all infinite sequences of ASCII characters excluding those 
representing control characters. 

[A.18] Intervals such as the set of all voltages lying between 1 volt and 20 
volts are useful subsets of the real line. These come in several forms, 
depending on whether or not the end points are included. Given 
b > a, define the “open” interval (a,b) = {r : a < r < b}, and given 
b > a, define the “closed” interval [a,b] = {r : a < r < b}. That is, 
we use a bracket if the end point is included and a parenthesis if it 
is not. We will also consider “half-open” or “half-closed” intervals of 
the form {a,b] = {r : a < r < b} and [a,b) = {r : a < r < b}. (We 
use quotation marks around terms like open and closed because we 
are not rigorously defining them, we are implicitly defining them by 
their most important examples, intervals of the real line). 

[A.19] The set of all vectors of k voltages such that the largest value is 
less than 1 volt. 

[A.20] The set of all sequences of voltages which are all nonnegative. 

[A.21] The set of all voltage-time waveforms that lie between 1 and 20 
volts for all time. 



Given a set F of points in an abstract space fl, we shall write to G F 
for “the point w is contained in the set E” and co ^ F for “the point uj 
is not contained in the set F.” The symbol G is referred to as the element 
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inclusion symbol. We shall often describe a set using this notation in the 
form F = {u : whas some property. Thus F = {u : tv € f}. For example, 
a set in the abstract space = {w : — oo < to < 00 } (the real line 3?) is 
{w : — 2 < w < 4.6}. The abstract space itself is a grouping of elements and 
hence is also called a set. Thus n = {w:wGn|. 

If a set F is a subset of another set G; that is, ii uj G F implies that 
also u! G G, then we write F C G. The symbol C is called the set inclusion 
symbol. Since a set is included within itself, every set is a subset of itself. 

An individual element or point ojq in F can be considered both as a 
point or element in the space and as a one-point set or singleton set {wqI = 
{uj : oj = Wo}. Note, however, that the braces notation is more precise when 
we are considering the one-point set and that loq G fl while {wq} C fl. 

The three basic operations on sets are complementation, intersection, 
and union. The definitions are given next. Refer also to Figure A.l as an 
aid in visualizing the definitions. In Figure A.l is pictured as the outside 
box and the sets F and G are pictured as arbitrary blobs within the box. 
Such diagrams are called Venn diagrams. 

Given a set F, the complement of F is denoted by F°, which is defined 
by 

F" = {w : w ^ F} , 

that is, the complement of F contains all of the points of Lt that are not in 

F. 

Given two sets F and G, the intersection of F and G is denoted by 
F n G, which is defined by 

F n G = }w '. UJ G F and w G G} , 

that is, the intersection of two sets F and G contains the points which are 
in both sets. 

If F and G have no points in common, then F n G = 0, the null set, 
and F and G are said to be disjoint or mutually exclusive. 

Given two sets F and G, the union of F and G is denoted by F U G, 
which is defined by 



FUG={w:wGF or uj G G} , 

that is, the union of two sets F and G contains the points that are either 
in one set of the other, or both. 

Observe that the intersection of two sets is always a subset of each of 
them, e.g., F n G C F. The union of two sets, however, is not a subset of 
either of them (unless one set is a subset of the other) . Both of the original 
sets are subsets of their union, e.g., F C F U G. 
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In addition to the three basic set operations, there are two others that 
will come in handy. Both can be defined in terms of the three basic opera- 
tions. Refer to Figure A. 2 as a visual aid in understanding the definitions. 



(a) F - G (b) FAG 

Figure A. 2: Set Difference Operations 

Given two sets F and G, the set difference of F and G is denoted by 
F — G, which is defined as 

F — G = {lv : tu € F and lo ^ G} = F n G'^; 

that is, the difference of F and G contains all of the points in F that are 
not also in G. Note that this operation is not completely analogous to the 
“minus” of ordinary arithmetic because there is no such thing as a “negative 
set.” 

Given two sets F and G, their symmetric difference is denoted by FAG, 
which is defined as 

FAG = {u; : u; G F or u; £ G but not both} 

= (F-G)U(G-F) = (FnG")U(F"nG) 

= (FUG)-(FnG); 

that is, the symmetric difference between two sets is the set of points that 
are in one of the two sets but are not common to both sets. If both sets 
are the same, the symmetric difference consists of no points, that is, it is 
the empty set. If F C G, then obviously FAG = G — F. 

Observe that two sets F and G will be equal if and only if F C G and 
G C F. This observation is often useful as a means of proving that two 
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sets are identical: first prove that each point in one set is in the other and 
hence the first set is subset of the second. Then prove the opposite inclusion. 
Surprisingly, this technique is frequently much easier than a direct proof 
that two sets are identical by a pointwise argument of commonality. 

We will often wish to combine sets in a series of operations and to reduce 
the expression for the resulting set to its simplest and most compact form. 
Although the most compact form frequently can be seen quickly with the 
aid of a Venn diagram, as in Figures A.l and A. 2, to be completely rigor- 
ous, the use of set theory or set algebra to manipulate the basic operations 
is required. Table A.l collects the most important such identities. The 
first seven relations can be taken as axioms in an algebra of sets and used 
to derive all other relations, including the remaining relations in the table. 
Some examples of such derivations follow the table. Readers who are famil- 
iar with Boolean algebra will find a one-to-one analogy between the algebra 
of sets and Boolean algebra. 

DeMorgan’s “laws” (A. 6) and (A. 10) are useful when complementing 
unions of intersections. Relation (A. 16) is useful for writing the union of 
overlapping sets as a union of disjoint sets. A set and its complement are 
always disjoint by relation (A. 5). 



A. 2 Examples of Proofs 

Relation (A. 8). From the definition of intersection and Figure A.l we verify 
the truth of (A. 8). Algebraically, we show the same thing from the basic 
seven axioms: From (A. 4) and (A. 6) we have that 

AnB={{An BYf = (A^ U By , 

and using (A.l), (A. 4), and (A. 6), this becomes 

{B^^ \j Ay = {By {Ay 



as desired. 

Relation (A. 18). Set F = in (A. 5) to obtain n 11“ = 0, which with 
(A. 7) and (A. 8) yields (A. 19). 

Relation (A. 11). Complement (A. 5), (F“ n F)“ = 0“, and hence, using 
(A. 6), (F“ U F) = 0“, and finally, using (A. 4) and (A. 18), F“ U F = 11. 

Relation (A. 12). Using F“ in (A. 7): F“ n U = F“. Complementing the 
result: (F“ n U)“ = (F“)“ = F (by (A.4)). Using (A.6): (F“ C U)“ = 
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FUG = 


G U F commutative law 


(A.l) 


\{GUH) = 


{F U G) U H associative law 


(A.2) 


\{GUH) = 


(F n G) U (F n H) 






distributive law 


(A.3) 


{py = 


F 


(A.4) 


FnF‘^ = 


0 


(A.5) 


{edgy = 


F'^ U G° DeMorgan’s “law” 


(A.6) 


Fnn = 


F 


(A.7) 



FnG 


= G n F commutative law 


(A.8) 


Fn (GnF) 


= (F n G) n F associative law 


(A.9) 


{FUGY 


= F“ n G“ DeMorgan’s other “law” 


(A.IO) 


FUF“ 


= n 


(A.ll) 


FU0 = F 




(A.12) 


F U (F n G) 


= F = F n (F U G) 


(A.13) 


Fun 


= n 


(A. 14) 


Fn 0 


= 0 


(A.15) 


FUG 


= F U (F“ n G) = F U (G - F) 


(A.16) 


FU (GnF) 


= (FUG)n(FUF) distributive law 


(A. 17) 




= 0 


(A.18) 


FUF 


= F 


(A.19) 


FnF 


= F 


(A.20) 



Table A.l: Set Algebra 
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FULl<= = F. From (A.18) = 0, yielding (A.12). 

Relation (A. 20). Set G = F and H = in (A. 3) to obtain F n (F U 
F‘^) = (FnF)U(FnF'=) = FnFusing (A.5) and (A.12). Applying (A. 11) 
and (A. 7) to the left-hand side of this relation yields Fnn = F = FnF. 

Relation (A. 19). Complement (A. 20) using (A. 6) and replace F'^ by F. 



The proofs for the examples were algebraic in nature, manipulating the 
operations based on the axioms. Proofs can also be constructed based 
on the definitions of the basic operations. For example, DeMorgan’s law 
can be proved directly by considering individual points. To prove that 
(F n GY = F° U it suffices to show separately that (F n GY C F^ilG^ 
and F'^ U C (F n GY- Suppose that w G (F n G)°, then w ^ F n G from 
the definition of complement and hence uj ^ F or lo ^ G from the definition 
of intersection (if lo were in both, it would be in the intersection). Thus 
either w G F° or w G G° and hence w G F'^UG°. Conversely, if w G F'^UG°, 
then u G F'^ or lo G G'^, and hence either a; F or a; G, which implies 
that w y^ F n G, which in turn implies that to G (F n G)°, completing the 
proof. 

We will have occasion to deal with more general unions and intersec- 
tions, that is, unions or intersections of more than two or three sets. As 
long as the number of unions and intersections is finite, the generalizations 
are obvious. The various set theoretic relations extend to unions and inter- 
sections of finite collections of sets. For example, DeMorgan’s law for finite 
collections of sets is 



( n n 

i=l / i=l 

For example, consider the finite set version of DeMorgan’s law 

This result can be proved using the axioms or by induction. Point 
arguments are often more direct. Define the set on the left hand side of the 
equation as G and that on the right hand side as H and to prove G = H 
by considering individual points. This is done by separately showing that 
G C H and H C G, which implies the two sets are the same. To show that 
G C H, let to G G = (nr=i which means that oj ^ (X)=xFi, which 
means that lo ^ Fi for some i or, equivalently, that uo G Ff for some i. This 
means that lo G IJfci F)) and hence that oo G FI. Thus G C FI since we 
have shown that every point in G is also in H. The converse containment 
follows in a similar manner. If u; G F = ljr=i ^ ^ some i 
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and hence to ^ Fi for some i. This implies that lo ^ Plfc! hence that 

CO G G, completing the proof. 

The operations can also be defined for quite general infinite collections 
of sets as well. Say that we have an indexed collection of sets {Ai;iGT}, 
sometimes denoted {Ai}i^x, for some index set 2. In other words, this 
collection is a set whose elements are sets — one set Ai for each possible 
value of an index i drawn from X. We call such a collection a family or 
class of sets. (To avoid confusion we never say a “set of sets.”) The index 
set X can be thought of as numbering the sets. Typical index sets are 
the set Z+ of all nonnegative integers, Z = {. . . , — 1, 0, 1, . . . }, or the real 
line 3?. The index set may be finite in that it has only a finite number of 
entries, say X = = {0, 1, . . . , fc— 1}. The index set is said to be countably 

infinite if its elements can be counted, that is, can be put into a one-to-one 
correspondence with a subset of the nonnegative integers Z_|_; e.g., Z_|_ or Z 
itself. If an index set has an infinity of elements, but the elements cannot 
be counted, then it is said to be uncountably infinite, for example 3? or the 
unit interval [0, 1] (see problem 11). 

The family of sets is said to be finite, countable, or uncountable if the 
respective index set is finite, countable, or uncountable. As an example, 
the family of sets {[0, 1/r); r G X} is countable if X = Z and uncountable 
if X = Sf{. Another way of describing countably infinite sets is that they can 
be put into one-to-one correspondence with the integers. For example, the 
set of rational numbers is countable because it can be enumerated, the set 
of irrational numbers is not. 

The obvious extensions of the pairwise definitions of union and intersec- 
tion will now be given. Given an indexed family of sets {Ax, i G X}, define 
the union by 

Ai = {oj ■. 00 G Ai for at least one i gX} 

iei 

and define the intersection by 

^Ai = {uo\uoGAi for all i GX} . 

iei 

In certain special cases we shall make the notation more specific for 
particular index sets. For example, if X = {0, ... , n — 1}, then we write the 
union and intersection as 

n— 1 n— 1 

IJ Ai and Q Ai 
i—0 i=0 



respectively. 
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A collection of sets {Ai; i G X} is said to be disjoint or pairwise disjoint 
or mutually exclusive if 

Fi n Fj = 0; all i,j€l,i¥=j, 

that is, if no sets in the collection contain points contained by other sets in 
the collection. 

The class of sets is said to be collectively exhaustive or to exhaust the 
space if 

U Xi = , 

iei 

that is, together the Fi contain all the points of the space. 

A collection of sets {Fj-, i G X} is called a partition of the space 0 if the 
collection is both disjoint and collectively exhaustive. A collection of sets 
{Xi; i G X} is said to partition a set G if the collection is disjoint and the 
union of all of its members is identical to G. 



A. 3 Mappings and Functions 

We shall make much use of mappings of functions from one space to another. 
This is of importance in a number of applications. For example, the wave- 
forms and sequences that we considered as members of an abstract space 
describing the outputs of a random process are just functions of time, e.g., 
for each value of time t in some continuous discrete collection of possible 
times we assigned some output value to the function. As a more compli- 
cated example, consider a binary digit that is transmitted to a receiver at 
some destination by sending either plus or minus V volts through a noisy 
environment called a “channel.” At the receiver a decision is made whether 
+V or —V was sent. The receiver puts out a 1 or a 0, depending on the 
decision. In this example three mappings are involved: The transmitter 
maps a binary symbol in {0, 1} into either +V or —V. During transmission, 
the channel has an input either +V or —V and produces a real number, not 
usually equal to 0, 1, -t-V, or —V. At the receiver, a real number is viewed 
and a binary number produced. 

We will encounter a variety of functions or mappings, from simple arith- 
metic operations to general filtering operations. We now introduce some 
common terminology and notation for handling such functions. Given two 
abstract spaces O and A, an A-valued function or mapping X or, in more 
detail, / : w — > A, is an assignment of a unique point in A to each point in 
ri; that is, given any point w G fl, /(w) is some value in A. O is called the 
domain or domain of definition of the function /, and A is called the range 
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of /. Given any sets F C Lt and G C A, define the image of F (under /) 
as the set 

f{F) = {a : a = f{uj) for some uj G F} 
and the inverse image (also called the preimage) of G (under /) as the set 

/-1(G) = {cc:/(u;)gG} . 

Thus f{F) is the set of all points in A obtained by mapping points in F, 
and f~^{G) is the set of all points in Lt that map into G. 

For example, let Lt = [—1,1] and A = [—10,10]. Given the function 
f{uj) = ijp' with domain PL and range A, define the sets F = (—1/2, 1/2) C Pt 
and G = (—1/4,1) C A. Then f{F) = [0,1/4) and /“^(G) = [-1,1]. As 
you can see from this example, not all points in G have to correspond to 
points in F. In fact, the inverse image can be empty; e.g., continuing the 
same example, /“^((— 1/4, 0)) = 0. 

The image of the entire space PI is called the range space of /, and it 
need not equal the range; e.g., the function / could map the whole input 
space into a single point in A. For example, / : 3? ^ 3? defined by /(r) = 1, 
all r, has a range space of a single point. If the range space equals the 
range, the mapping is said to be onto. (Is the mapping / of the preceding 
example onto? What is the range space? Is the range unique?) 

A mapping / is called one-to-one ii x ^ y implies that f{x) yf f{y). 



A. 4 Linear Algebra 

We collect a few definitions and results for vectors, matrices, and determi- 
nants. 

There is a variety of notational variation for vectors. Historically a 
common form was to use boldface, as in x = (xo,xo,... ,Xk-i), denote 
a /c-dimensional vector with k components Xi, i = 0,l,...,A: — 1. When 
dealing with linear algebra, however, it is most commonly the convention 
to assume that vectors are column vectors, e.g., 

xo \ 

Xi 

. 5 

V Xk-l J 

or, as an in-line equation, x = (xq,xo, . . . ,Xk-i)*, where t denots “trans- 
pose.” We will often be lazy and write vectors inline without explicitly 
denoting the transpose unless it is needed, e.g., in vector/matrix equations. 



/ 

X = 
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Although boldface makes it clear which symbols are vectors and which are 
scalars, in modern practice it is more common to drop the distinction and 
not use boldface, i.e., to write a vector as simply x = (xo, Xi, . . . , x^-i) or, 
if it is desired to make clear it is a column vector, as x = (xq, xi, . . . , Xk-i)*- 
Both boldface and non-boldface notations are used in this book. Generally, 
early on the boldface notation is used to clarify when vectors or scalars are 
being used while later in the book boldface is often dropped. 

The inner product (or dot product) of two real-valued n-dimensional 
vectors y and n id defined by the scalar value 

n— 1 

x^y = ^ Xiy^. (A. 22) 

If the vectors are more generally complex valued, then the transpose is 
replaced by a conjugate transpose 



n— 1 

* \ ^ * 

X y= 

i=0 

A matrix is a rectangular array of numbers 



0-0,0 Oo,l Oo,2 

Ol,0 Ol,l Oi^2 



A = 



Om — 1,0 1,1 1,2 



(A.23) 



00, n-1 

01, n-l 



l,n— 1 



with m rows and n columns. Boldface notation is also used for matrices. If 
m = n the matrix is said to be square. A matrix is symmetric if A* = A, 
where A* is the transpose of the matrix A, that is, the nx m matrix whose 
k,jtli element is (A*)k,j = Oj^k- If the matrix has complex elements and 
A* = A, where * denotes conjugate transpose so that {A*)k,j = a* then 
A is said to be Hermitian. 

The product of an m x n matrix and an n-dimensional vector y = Ax 
is an TO dimensional vector with components 



n— 1 

yi — ^ ^ ^i^kXi’) 



that is, the inner product of x and the ith row of A. 
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The outer product of two n-dimensional vectors y and n id defined as 
thte n by n matrix 

xoVo xoyi xoy2 ■ ■ ■ xoyn-i 

xiyo xiyi xij/2 • • • xiy^^i 

(A.24) 

^n—lVO ^n—iyi ^n—iy2 * ‘ ‘ ^n—iyn—1 

Given a square matrix A, a scalar A is called an eigenvalue and a vector 
u is called an eigenvector if 

Au = Xu. (A. 25) 

A n by n matrix has n eigenvalues and eigenvectors, but they need not 
be distinct. Eigenvalues provide interesting formulas for two attributes of 
matrices, the trace defined by 

n—1 

Tr(^) = ^ Gi^i 

i^O 

and the determinant of the matrix det(A): 



Tr(A) = 


n—1 

J2x. 


(A.26) 


det(A) = 


n—1 

IlA. 

2=0 


(A.27) 


The arithmetic mean/geometric mean 


inequality says that the arithmetic 


mean is bound below by the geometric 


mean: 




n \ 

i^O \ 


n-l \ n 

n^. 

2=0 / 


(A.28) 



with equality if and only if the A are all the same. Application of the 
inequality to the eigenvalue representation of the determinant and trace 
provides the inequality 

1/n 




Tr(A) > (det(A)) 



(A.29) 
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with equality if and only if the eigenvalues are all constant. 

A square Hermitian matrix A can be diagonalized into the form 

A = UAU*, (A.30) 

where A is the diagonal matrix with diagonal entries A(fc, k) = Xk, the 
/cth eigenvalue of the matrix, and where U is a, unitary matrix, that is, 
U* = c/-^ 

The inner product and outer product of two vectors can be related as 

x^y = Tr(xy*). (A. 31) 

Given an n-dimensional vector x and an n by n matrix A, the product 



n— 1 n— 1 

x*Ax = EE 

k—0 j—0 

is called a quadratic form. If the matrix A is such that x^Ax > 0, the matrix 
is said to be nonnegative definite. If the matrix is such that x*Ax > 0, then 
the matrix is said to be positive definite. These are the definitions for 
real-valued vectors and matrices. For complex vectors and matrices use the 
conjugate transpose instead of the transpose. If a matrix is positive definite, 
then its eigenvalues are all strictly positive and hence so is its determinant. 
A quadratic form can also be written as 

x^Ax = Tv{Axx^). (A. 32) 

If a matrix is A positive definite and Hermitian (e.g., real and symmet- 
ric), then its square root A^/"^ is well-defined as [7A^/^[/*. In particular, 

A. 5 Linear System Fundamentals 

In general, a system £ is a mapping of an input time function or input 
signal, x = |a;(t); t € T} into an output time function or output signal, 
C{x) = y = {y{t); t £ T}. We now use T to denote the index set or 
domain of definition instead of T to emphasize that the members of the set 
correspond to “time.” Usually the functions take on real or complex values 
of each value of time t in T. The system is called a discrete time system if 
T is discrete; e.g., Z or Z+, and it is called a continuous time system if T 
is continuous; e.g., 3? or [0,oo). If only nonnegative times are allowed, e.g., 
T is Z_|_ or [0, oo), the system is called a one-sided or single-sided system. If 
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time can go on infinitely in both directions, then it is said to be a two-sided 
system. 

A system C is said to be linear if the mapping is linear, that is, for all 
complex (or real) constants a and h and all input functions Xi and X 2 

£{axi + bx 2 ) = aC{x\) + bC{x 2 ) ■ (A-21) 

There are many ways to define or describe a particular linear system: 
one can provide a constructive rule for determining the output from the 
input; e.g., the output may be a weighted sum or integral of values of the 
input. Alternatively, one may provide a set of equations whose solution 
determines the output from the input, e.g., differential or difference equa- 
tions involving the input and output at various times. Our emphasis will 
be on the former constructive technique, but we shall occasionally consider 
examples of other techniques. 

The most common and the most useful class of linear systems comprises 
systems that can be represented by a convolution, that is, where the output 
is described by a weighted integral or sum of input values. We first consider 
continuous time systems and then turn to discrete time systems. 

For t G T C 5ft, let x{t) be a continuous time input to a system with 
output y{t) defined by the superposition integral 

y{t) = / x{t — s)ht{s) ds . (A. 22) 

J sit—s^T 

The function ht{t) is called the impulse response of the system since it can 
be considered the output of the system at time t which results from an input 
of a unit impulse of Dirac delta function x{t) = S(t) at time 0. The index 
set is usually either (— 00 , 00 ) or [0,oo) for continuous time systems. The 
linearity of integration implies that the system defined by (A. 22) is a linear 
system. A system of this type is called a linear filter. If the impulse response 
does not depend on time t, then the filter is said to be time-invariant and 
the superposition integral becomes a convolution integral. 

y{t) = / x{t — s)h{s) ds = / x{s)h{t — s) ds . (A. 23) 

Js:t-seT JseT 

We shall deal almost exclusively with time-invariant filters. Such a linear 
time-invariant system is often depicted using a block diagram as in Figure 

A.3. 

If x{t) and h{t) are absolutely integrable, i.e.. 



\x{t)\dt, / \h{t)\dt < 00 



IT 



IT 



(A.24) 
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x{t) 




s)ds 



Figure A. 3: Linear Filter 

r F(.f] ■ 



x{t) 




z{t) 



L J 

A(/) H,{f) Y{f) H,{f) Z{f) 



Figure A. 4: Cascade Filter 



then their Fourier transforms exist: 

X{f)=J x{t)e~^^^f^dt, H{f) = J h{t)e-^^^f*dt . (A.25) 

Continuous time filters satisfying (A. 24) are said to be stable. H{f) is 
called the filter transfer function or the system function. We point out 
that (A. 24) is a sufficient but not necessary condition for the existence of 
the transform. We shall not usually be concerned with the fine points of 
the existence of such transforms and their inverses. The inverse transforms 
that we require will be accomplished either by inspection or by reference 
to a table. 

A basic property of Fourier transforms is that convolution in the time 
domain corresponds to multiplication in the frequency domain, and hence 
the output transform is given by 

Y{f) = H{f)X{f) . (A.26) 

Even if a particular system has an input that does not have a Fourier 
transform, (A.26) can be used to find the transfer function of the system 
by using some other input that does have a Fourier transform. 

As an example, consider Figure A. 4, where two linear filters are con- 
catenated or cascaded: x{t) is input to the first filter, and the output y{t) 
is input to the second filter, with final output z(t). If both filters are stable 
and x{t) is absolutely integrable, the Fourier transforms satisfy 

Y{f) = H,{f)X{f ) , Z{f) = H,{f)Y{f) , 



(A.27) 
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or 

Z{f) = H2{f)H,{f)X{f) . 

Obviously the overall filter transfer function is i^(/) = 

The overall impulse response is then the inverse transform of H( f). 

Frequently (but not necessarily) the output of a linear filter can also be 
represented by a finite order differential equation in terms of the differential 
operator, D = d/dt : 

n m 

Y, auD'^yit) = Y ■ (A.28) 

k—0 i—0 

The output is completely specified by the input, the differential equation, 
and appropriate initial conditions. Under suitable conditions on the dif- 
ferential equation, the linear filter is stable, and the transfer function can 
be obtained by transforming both sides of (A.28). However, we shall not 
pursue this approach further. 

Turn now to Figure A. 5. Here we show an idealized sampled data sys- 
tem to demonstrate the relationship between discrete and continuous time 
filters. The input function x{t) is input to a mixer, which forms the product 
of x{t) with a pulse train, p{t) = X)fc6r'^(^ ~ Dirac delta functions 

spaced one second apart in time. T is a suitable subset of Z. If we denote 
the sampled values x{k) by Xk, the product is 

x{t)p{t) = Y^ - k), 

k 

which is the input to a linear filter with impulse response h{t). Applying 
the convolution integral of equation (A. 23) and sampling the output with 
a switch at one-second intervals, we have as an output function at time n 



Vn 



y{n) 

j x{t)p{t)h{t — n) dt 
/ Xkd{t — k)h{t — n) dt 

k 

^ ^ ^khn—k 
k-.keT 

^ ^ ^n—kdk • 



k-.n — k^T 



(A.29) 



Thus, macroscopically the filter is a discrete time linear filter with a 
discrete convolution sum in place of an integral, {hk} is called the Kro- 
necker S response of the discrete time filter. Its name is derived from the 
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x{t) 




x{t)p{t) 




y{t) 



P(t) = 



Figure A. 5: Sampled Data System 



fact that hk is the output of the linear filter at time k when a Kronecker 
delta function is input at time zero. It is also sometimes referred to as the 
“discrete time impulse response” or the “unit pulse response.” If only a 
finite number of the hk are nonzero, then the filter is sometimes referred 
to as an FIR (finite impulse response) filter. If a filter is not an FIR filter, 
then it is an HR (infinite impulse response) filter. 

If {hk} and {xk} are both absolutely summable, 

\hk\ < oo , \xk\ < oo , (A. 30) 

k k 

then their discrete Fourier transforms exist: 

H{f) = ^ hke-^^^>^f , X{f) = Y, ^ (a.31) 

k k 

Discrete time filters satisfying (A. 30) are said to be stable. H{f) is called 
the filter transfer function. The output transform is given by 

Y{f) = H{f)X{f) . (A.32) 

The example of Figure A. 4 applies for discrete time as well as continuous 
time. 

For convenience and brevity, we shall occasionally use a general notation 
X to denote both the discrete and continuous Fourier transforms; that is. 



E{x) 




k&T 



T continuous, 
T discrete. 



(A.33) 



A more general discrete time linear system is described by a difference 
equation of the form 



^ ^ ^kljn—k — ^ ^ biXn—k ■ 
k i 



(A.34) 
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Observe that the convolution of (A. 29) is a special case of the above where 
only one of the ak is not zero. Observe also that the difference equation 
(A. 34) is a discrete time analog of the differential equation (A. 28). As 
in that case, to describe an output completely one has to specify initial 
conditions. 

A continuous time or discrete time filter is said to be causal if the pulse 
response or impulse response is zero for negative time; that is, if a discrete 
time pulse response hk satisfies /ifc = 0 for fc < 0 or a continuous time 
impulse response h{t) satisfies h{t) = 0 for t < 0. 



A. 6 Problems 

1. Use the first seven relations to prove relations (A. 10), (A. 13), and 
(A.16). 

2. Use relation (A.16) to obtain a partition [Gi] i = 1,2,... ,k} of U 
from an arbitrary finite class of collectively exhaustive sets {Ai; i = 
1,2,... ,k} with the property that Gi C Fi for all i and 

i i 

\JG,= \Jf, alH. 

i=i i=i 

Repeat for a countable collection of sets {Fi}. (You must prove that 
the given collection of sets is indeed a partition.) 

3. If {Fi} partitions ft, show that {G n Fi} partitions G. 

4. Show that F C G implies that FnG = F, FU G = G, and C 

5. Show that if F and G are disjoint, then F C G'^. 

6. Show that F n G = {F U G) — (FAG). 

7. Let Fr = [0, 1/r), r G (0, 1]. Find |J Fr and Q Fr. 

re(o,i] i-e(o,i] 

8. Prove the countably infinite version of DeMorgan’s “laws.” For ex- 
ample, given a sequence of sets Fj; i = 1,2,... , then 



n«= 
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9. Define the subsets of the real line 

= (r- : |r| > , 

and 

i^ = {0} . 

Show that 

OO 

n—1 

10. Let Fi, i = 1,2, .. . be a countable sequence of “nested” closed in- 
tervals whose length is not zero, but tends to zero; i.e., for every i, 
Fi = [ai, bi] C Fi-i C Fi -2 ■ ■ • and bi — Ui ^ 0 and i oo. What are 

OO 

the points in n 

i=l 

11. Prove that the interval [0, 1] cannot be put into one-to-one correspon- 
dence with the set of integers as follows: Suppose that there is such 
a correspondence so that xi,X 2 ,X 3 , - ■ ■ is a listing of all numbers in 
[0, 1]. Use Problem 10 to construct a set that consists of a point not 
in this listing. This contradiction proves the statement. 

12. Show that inverse images preserve set theoretic operations, that is, 
given / : U — > A and sets F and G m A, then 

r\F^) = {r\F)Y . 

/-1(U UG) = /-1(F) U /-1(G) , 

and 

/-1(F nG) = /-1(F) n /-1(G) . 

If {Fi, i G F} is an indexed family of subsets of A that partitions A, 
show that |/-i(Fj); i € Fj is a partition of U. Do images preserve 
set theoretic operations in general? (Prove that they do or provide a 
counterexample. ) 

13. An experiment consists of rolling two four-sided dice (each having 
faces labeled 1, 2, 3, 4) on a glass table. Depict the space fi of 
possible outcomes. Define two functions on ft: Xi{lo) = the sum 
of the two down faces and X 2 {oj) = the product of the two down 
faces. Let Ai denote the range space of Xi,A 2 the range space of X 2 , 
and Ai 2 the range space of the vector- valued function A = (Ai, A 2 ), 
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that is, X{lo) = (Xi{uj),X 2 {lo)). Draw in both ft and A 12 the set 

2 

{to : Xi{uj) < X 2 {to)}. The cartesian product of two sets is 

i—1 

defined as the collection of all pairs of elements, one from each set, 
that is 

2 

= {all a,b : a G Ai,b G A 2 } . 

i=l 

2 

Is it true above that A 12 = 

i=l 

14. Let Lt = [0, 1] and A be the set of all infinite binary vectors. Find 
a one-to-one mapping from 12 to A, being careful to note that some 
rational numbers have two infinite binary representations (e.g., 1/2 = 
.1000 . . . = .0111 ... in binary). 

15. Can you find a one-to-one mapping from: 

(a) [0,1] to [0,2)? 

(b) [0, 1] to the unit square in two-dimensional Euclidean space. 

(c) Z to Z+? When is it possible to find a one-to-one mapping from 
one space to another? 

16. Suppose that a voltage is measured that takes values in 12 = [0, 15]. 
The voltage is mapped into the finite space A = (0, 1, • • • , 15} for 
transmission over a digital channel. A mapping of this type is called 
a quantizer. What is the best mapping in the sense that the maximum 
error is minimized? 

17. Let A be as in Problem 16, i.e., the space of 16 messages which is 
mapped into the space of 16 waveforms, B = (cos nt, n = 0, 1, • • • ,15; t G 
[0, 27r]|. The selected waveform from B is transmitted on a waveform 
channel, which adds noise; i.e., B is mapped into C = (set of all 
possible waveforms {y{t) = cos n2 -I- noise (t); t G [0, 27r]}}. (This is a 
random mapping in a sense that will be described in subsequent chap- 
ters.) Find a good mapping from C into D = A. D is the decision 
space and the mapping is called a decision rule. (In other words, how 
would you perform this mapping knowing little of probability theory. 
Your mapping should at least give the correct decision if the noise is 
absent or small.) 

18. Given a continuous time linear filter with impulse response h{f) given 
by for > 0 and 0 for 2 < 0, where a is a positive constant, find 
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the transfer function H(f) of the filter. Is the filter stable? What 
happens if a = 0? 



19. Given a discrete time linear filter with pulse response hk given by 
for k > 0 and 0 for A: < 0, where r has magnitude strictly less 
than 1, find the transfer function H{f). (Hint: Use the geometric 
series formula.) Is the filter stable? What happens if r = 1? Assume 
that |r| < 1. Suppose that the input Xfc = 1 for all nonnegative k 
and Xfe = 0 for all negative k is put into the filter. Find a simple 
expression for the output as a function of time. Does the transform 
of the output exist? 



20. A continuous time system is described by the following relation: Given 
an input x = {x{t)] t G 3?} is defined for each t by 

y{t) = {ao + aix{t)) cos(27r/ot + 9) , 

where ao,ai,fo, and 0 are fixed parameters. (This system is called 
an amplitude modulation (AM) system.) Under what conditions on 
the parameters is this system linear? Is it time-invariant? 

21. Suppose that x = {x{t); t G 7^}, where TZ = (— oo, oo) is the real line, 
is a continuous time signal defined by 



1 \t\<T 

0 otherwise, 



where T > 0 is a fixed parameter, is put into a linear, time-invariant 
(LTI) filter described by an impulse response h = {h{t)] t G 7^}, 
where 



h{t) 



e"* t > 0 
0 otherwise. 



(a) Find the Fourier transform X of x, i.e.. 



/ OO 

x{t)e-^^^f* dt- /G7^, 

-OO 



where j = \/— 1. Find the Fourier transform H of h. 

(b) Find y, the output signal of the LTI filter, and its Fourier trans- 
form Y. 
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22. Suppose that x = {a:„; n G Z}, where Z is the set of all integers 
, -2,-1, 0, 1, 2, . . . }, is a discrete time signal defined by 



Xn 



r” n > 0 
0 otherwise, 



where r is a fixed parameter satisfying |r| < 1, is put into a linear, 
time-invariant (LTI) filter described by a Kronecker delta response 
h = {hn, n G Z}, where 






1 n = 0, 1,... ,1V-1 
0 otherwise. 



where TV > 0 is a fixed integer. This filter is sometimes called a “comb 
filter.” Note that the Kroncker delta response is the response to the 
filter when the input is the Kroncker delta (defined as 1 for n = 0 
and zero otherwise). 



(a) Find the (discrete-time) Fourier transform X of x, i.e., 

OO 

X{f)= ^ /G (-l/2,l/2). 

n——oo 



Find the Fourier transform H oi h. 

(b) Find y, the output signal of the LTI filter, and its Fourier trans- 
form Y . 



23. Look up or derive the formula for the sum of a geometric progression 

71 

E’'‘- 

Prove that the formula is true. Repeat for the sum 

OO 

k—0 

under the assumption that |r| < 1. 

24. Evaluate the following integrals: 

(a) 
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25. Evaluate the following integrals: 



(a) 

(b) 

(c) 
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Appendix B 

Sums and Integrals 



In this appendix a few useful definitions and results are gathered for refer- 
ence. 



B.l Summation 



The sum of consecutive integers. 






k=l 



n{n + 1) 
2 



(B.l) 



Proof: The result is easily proved by induction, which requires demonstrat- 
ing the truth of the formula for n = 1 (which is obvious) and showing 
that if the formula is true for any positive integer n, then it must also be 
true for n -I- 1. This follows since if 5'„ = ^ we assume that 

Sn = n{n + l)/2, then necessarily 



S, 



n+1 



proving the claim. 



(u -|- 1) 



(n+l)(^ + l) 

(n -I- l)(n -I- 2) 
2 



The sum of consecutive squares of integers. 






(B.2) 
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The sum can also be expressed as 

j 2 _ (2u + l)(n + l)n 

“ 6 ■ 

Proof: This can also be proved by induction, but for practice we note 
another approach. Just as in solving differential or difference equations, 
one can guess a general form of solution and solve for unknowns. Since 
summing fc up to n had a second order solution in n, one might suspect 
that solving for a sum up to n of squares of k would have a third order 
solution in n, that is, a solution of the form f{n) = an^ + b'nf + cn + d for 
some real numbers a, b, c, d. Assume for the moment that this is the case, 
then if /(n) = X)fe=i clearly = f{n) — f{n — 1) and hence with a 
little algebra 

= an^ + bn^ + cn + d — a{n — 1)^ + b{n — 1)^ + c(n — 1) + d 
= 3an^ + (26 — 3a)n + (a — 6 + c). 




This can only be true for all n hover if 3a = 1 so that a = 1 /3, if 26 — 3a = 0 
so that 6 = 3a/2 = 1/2, and ifa — 6 + c = 0so that c = b — a = 1/6. This 
leaves d, but the initial condition that /(I) = 1 implies d = 0. 



The geometric progression 

Given a complex constant a. 









1-a^ 

1 — a ’ 



and if |a| < 1 this sum is convergent and 






1 



fe =0 



— a 



(B.3) 



(B.4) 



Proof: There are, in fact, many ways to prove this result. Perhaps the 

-vn— 1 

n— 1 n— 1 



simplest is to define the sum with n terms S'„ = X)fe=o observe that 



( 1 — a) S'n = af^ — a 

k=0 fc =0 

n— 1 n 

= 1-a^, 
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proving (B.3). Other methods of proof include induction and solving the 
difference equation S'„ = Sn-i + a"“^. Proving the finite n result gives the 
infinite sum since if |a| < 1, 



a'" = lim Sn = - — - 
' n^oo 1 — a 



For the reader who might be rusty with limiting arguments, this follows 



' 1-a' '1-a' |l-a| 

as n ^ oo since by assumption |a| < 1. 

First moment of the geometric progression 

Given q G (0, 1), 






(1-9)" 



Proof: Since ^ = ~r since we can interchange differentiation 

dq 

and summation, 

OO 7 OO J 

fc=0 ^ fc=0 ^ 

where we have used the geometric series sum formula. 

Second moment of the geometric progression 

Given q G (0, 1), 



V ^ -4- ^ 

h {l-qf^{l-qy 

Take a second derivative of a geometric progression to find 

j2 oo oo 



^ k^O 

- oo - oo 

1 1 
= i V - 

„ 2^* 9 n - «i2 
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and 



so that 






E^v-' 



fc=0 



2 ^ 1 

(T^+ (T^’ 



proving the claim. 



B.2 TirDouble Sums 

The following lemma provides a useful simplification of a double summation 
that crops up when considering sample averages and laws of large numbers. 

Lemma B.l Given a sequence {a„}, 



N-lN-l N-l 

^ ^ {N-\n\)an- 

k=0 1=0 n=-N+l 

Proof: This result can be thought in terms of summing the entries of 
a matrix A = k,l G Z^} which has the property that all elements 

along any diagonal are equal, i.e., Ak^i = Ok-i for some sequence a. (As 
mentioned in the text, a matrix of this type is called a Toeplitz matrix. To 
sum up all of the elements in the matrix note that the main diagonal has 
N equal values of oq, the next diagonal up has N — 1 values of oi, and so 
on with the nth diagonal having N — n equal values of a„. Note there is 
only one element a^-i in the top diagonal. 

The next result is a limiting result for sums of the type considered in 
the previous lemma. 

Lemma B.2 Suppose that {a„; n G Z} is an absolutely summable se- 
quence, i.e., that 

OO 

Y |a«| < OO. 

n — — OC) 

Then 

A^— 1 I I OO 

n—-N-\-l n——oo 

Comment: The limit should be believable since the multiplier in the 
summand tends to 1 for each fixed n as N —>■ oo. 
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Proof: Absolute summability implies that the infinite sum exists and 



E 



an 



n— — oo 



N-1 



lim 
N — >oo 



E 






so the result will follow if we show that 



N-l I I 

lim V = 0. 

Since the sequence is absolutely summable, given an arbitrarily small e > 0 
we can choose an Nq large enough to ensure that for any N > Nq we have 

E 

n:|n|>A^ 



For any N > Nq we can then write 



N-l 






E FI I ^ 



N-l 






N 



ml I I I ml 

E i7l«"l+ E -w 



n:|n|<A^0~l 



n:No<\n\<N-l 



N 



< 



< 



E E 



n:|n|<A^0~l 



n:|n|>ATo 



N 



n:|n|<A^0~l 



Letting N —>■ oo the remaining term can be made arbitrarily small, proving 
the result. 



B.3 Integration 

A basic integral in calculus and engineering is the simple integral of an 
exponential, which corresponds to the sum of a “discrete time exponential,” 
a geometric progression. This integral is most easily stated as 



0 



e ” dr = 1. 



(B.8) 
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If Of > 0, then making a linear change of variables as r = ax or x = r/a 
implies that dr = adx and hence 

e-“^dx=-. (B.9) 

a 

Integrals of the form 

dx 

can be evaluated by parts, or by using the same trick that worked for the 
geometric progression. Take the fcth derivative of both sides of B.9 with 
respect to a: 










(B.IO) 



Computations using a Gaussian pdf follow from the basic integral 




This integral is a bit trickier than the others considered. It can of course 
be found in a book of tables, but again a proof is provided to make it seem 
a bit less mysterious. The proof is not difficult, but the initial step may 
appear devious. Simplify things by considering the integral 




and note that this one dimensional integral can also be written as a two 
dimensional integral: 
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This subterfuge may appear to actually compli cate ma tters, but it allows 
us to change to polar coordinates using r = x = rcos{9), y = 

rsin(0), and dxdy = rdr dO to obtain 



4 ' 



2 




dr dO 



Again this might appear to have complicated matters by introducing the 
extra factor of r, but now a change of variables of m = or r = ^/u implies 
that dr = du/2^/u so that 



using (B.8). Thus 








(B.ll) 



This is commonly expressed by changing variables to rl^/2 = x so that 
dx = dr/sqrt2 and the result becomes 




(B.12) 



from which it follows that a 0 mean unit variance Gaussin pdf has unit 
integral. The general case is handled by a change of variables. In the 
following integral change variables by defining r = {x — m)ja so that dx = 
a dr 





(x — m)^ 

2,t2 



dx 



1 

= , / e—r^ adr 

J-oo 


(B.13) 








(B.14) 


= 1. 


(B.15) 



B.4 AThe Lebesgue Integral 

This section provides a brief introduction to the Lebesgue integral, the cal- 
culus that underlies rigorous probability theory. In the authors view the 
Lebesgue integral is not nearly as mysterious as it is sometimes suggested in 




424 



APPENDIX B. SUMS AND INTEGRALS 



the engineering literature and that, in fact, it has a very intuitive engineer- 
ing interpretation and avoids the rather clumsy limits required to study the 
Riemann integral. We here present a few basic definitions and properties 
without proof. Details can be found in most any book on measure theory 
or integration and in many books on advanced probability, including the 
first author’s Probability, Random Processes, and Ergodic Properties [22], 
Suppose that {Lt,T,P) is a probability space as defined in chapter 2. 
For simplicity we focus on real-valued random variables, the extensions to 
complex random variables and more general random vectors are straight- 
forward. The integral or expectation of a random variable / defined on this 
probability space is defined in a sequence of steps treating random variables 
of increasing generality. 

First suppose that / takes on only a finite number of values, for example 

N 

fix) = X eH, (B.16) 

i=l 

where it is assumed that Fi £ T for all i. A discrete random variable of 
this form is sometimes called a simple function. The (Lebesgue) integral of 
/ or expectation of / is then defined by 

N 

fdP = J2a,P{F,). (B.17) 

i=l 

The integral is also written as J f{x) dP{x) and is also denoted by E{f). 
It is easy to see that this definition reduces to the Riemann integral. 

The definition is next generalized to all nonnegative random variables 
by means of a sequences of quantizers which map the random variable into 
an ever better approximation with only a finite possible number of outputs. 
Define for each real r and each positive integer n the quantizer 

n r > n 

{k - 1)2"” {k - 1)2"" < r < /c2"”, fc = 1, 2, . . . , n2” 
-{k- 1)2-" -{k - 1)2-" >r> fc2-", /c = 1, 2, . . . , n2" 

— n r < — n 

(B.18) 

The sequence of quantizers is asymptotically accurate in the sense that 

f{x) = lim qn{f{x)) 

n— ^oo 




(B.19) 
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It can be shown without much effort that thanks to the specific construc- 
tion the sequence g„(a;) is monotone increasing up to x. Given a general 
nonnegative random variable /, the integral is defined by 

[ fdP= lim [ qn{f)dP, (B.20) 

J n^oo J 

that is, as the limit of the simple integrals of the asymptotically accurate 
sequence of quantized versions of the random variable. The monotonicity 
of the quantizer sequence is enough to prove that the limit is well defined. 
Thus the expectation or integral of any nonnegative random variable exists, 
but it might be infinite. 

For an arbitrary random variable /, the integral is defined by breaking / 
up into its positive and negative parts, defined by = max(/(a:), 0) > 0 

and f~{x) = — min(/(x), 0) so that f{x) = f~^{x) — f~{x) > 0, and then 
defining 

J fdP = J f+dP- J f- dP, (B.21) 

provided that this does not have the indeterminate form oo — oo, in which 
case the integral does not exist. 

This is one of several equivalent ways to define the Lebesgue integral. A 
random variable / is said to be integrable or P-integrable if E(f) = f f dP 
exists and is finite. It can be shown that if / is integrable, then 

[ fdP= lim [ qn{f)dP, (B.22) 

J n^oo J 

that is, the form used to define the integral for nonnegative / gives the 
integral for integrable /. 

A highly desirable property of integrals and one often taken for granted 
in engineering applications is that limits and integrations can be inter- 
changed, e.g., if we are told we have a sequence of random variables /„; n = 
1, 2, 3, . . . which converge to a random variable / with probability 1, that 
is, F = {u : lim„^oo /n(w) = /(w)} is an event with P{F) = 1, then 

lim [ fndP^^ [ fdP (B.23) 

n^ooj J 

Unfortunately this is not true in general and the Riemann integral in par- 
ticular is poor when it comes to results along this line. There are two very 
useful such convergence theorems, however, for the Lebesgue integral, which 
we state next without proof. The first shows that this desirable property 
holds when the random variables are monotone, the second when the are 
dominated by an integrable random variable. 




426 



APPENDIX B. SUMS AND INTEGRALS 



Theorem B.l If fn', n = 1,2, ... is a sequence of nonnegative random 
variables that is monotone increasing up to f (with probability 1 ) and /« > 0 
(with probability 1) for all n, then 

fndP = jfdP. (B.24) 

Theorem B.2 If fn', n = 1, 2, . . . is a sequence of random variables that 
converges to f (with probability 1) and if there is an integrable function g 
which dominates the sequence in the sense that and \f„\ < g (with proba- 
bility 1 ) for all n, then 




lim 

n— ^oo 



fndP 



fdP. 



(B.25) 




Appendix C 

Common Univariate 
Distributions 



Binary pmf. = {0, 1}; p(0) = 1 — p, p(l) = p, where p is a parameter 
in (0,1). 
mean: p 

variance: p{l — p) 

Uniform pmf. = {0, 1, . . . ,n — 1} and p{k) = 1/n; k € Z„. 

mean: n + 1/ over2 

• (2n+l)(n+l)n / i i / 

variance: ^ ^ [n I / overz) . 

Binomial pmf. = Z„_|_i = {0, 1, . . . ,n} and 



where 

n! 

kl{n — k)\ 

is the binomial coefficient, 
mean: np 

variance: np{l — p) 

Geometric pmf. fl = {1,2,3,...} and p{k) = (1 — p)^~^p; k = 
1,2, , where p G (0, 1) is a parameter, 
mean: - 

p 
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variance: 

Poisson pmf. 0, = Z_|_ = {0, 1, 2, ... } and p{k) = (A^e“^)/fc!, where A 

is a parameter in (0, oo). (Keep in mind that 0! = 1.) 
mean: A 
variance: A 

Uniform pdf. Given b > a, /(r) = I/{b— a) for r G [a,b\. 
mean: 

(b-af 

variance: ^ 

Exponential pdf. /(r) = Ae“^’’; r > 0. 
mean: A 
variance: A^ 

Doubly exponential (or Laplacian) pdf. /(r) = ^ r G 5ft. 

mean: 0 
variance: 2A^ 

Gaussian (or Normal) pdf. /(r) = (27rcr^)~^/^ exp( ); r G 5ft. 

Since the density is completely described by two parameters: the mean m 
and variance cr^ > 0, it is common to denote it by a^). 

mean: m 
variance: 

Gamma pdf /(r) = r > 0, where o > 0 and 6 > 0, 

where 

poo 

T{b)= / e-V^-^dr. 

Jo 

mean: ab 
variance: ab 

Logistic pdf. /(r) = ; r G 5ft, where A > 0. 

mean: 0 

variance: A^tt^/S 

Weibull pdf /(r) = ; r > 0, where a > 0 and 6 > 0. If 

6 = 2, this is called a Rayleigh distribution, 
mean: ar(l + 

variance: a^(r(l + f ) — T^(l + i)) 




Appendix D 

Supplementary Reading 



In this appendix we provide some suggestions for supplementary reading. 
Our goal is to provide some leads for the reader interested in pursuing the 
topics treated in more depth. Admittedly we only scratch the surface of the 
large literature on probability and random processes. The books referred 
to are selected based on our own tastes — they are books from which we 
have learned and from which we have drawn useful results, techniques, and 
ideas for our own research. 

A good history of the theory of probability may be found in Maistrov [39] , 
who details the development of probability theory from its gambling origins 
through its combinatorial and relative frequency theories to the develop- 
ment by Kolmogorov of its rigorous axiomatic foundations. A somewhat 
less serious historical development of elementary probability is given by Huff 
and Geis [30]. Several early papers on the application of probability are 
given in Newman [42] . Of particular interest are the papers by Bernoulli on 
the law of large numbers and the paper by George Bernard Shaw comparing 
the vice of gambling and the virtue of insurance. 

An excellent general treatment of the theory of probability and random 
processes may be found in Ash [1], along with treatments of real analysis, 
functional analysis, and measure and integration theory. Ash is a former 
engineer turned mathematician, and his book is one of the best available 
for someone with an engineering background who wishes to pursue the 
mathematics beyond the level treated in this book. The only subject of 
this book completely absent in Ash is the second-order theory and linear 
systems material of Appendix 5 and the related examples of chapter 6. 

Other good general texts on probability and random processes are those 
of Breiman [6] and Ghung [9]. These books are mathematical treatments 
that are relatively accessible to engineers. All three books are a useful addi- 
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tion to any library, and most of the mathematical details avoided here can 
be found in these texts. Wong’s book [58] provides a mathematical treat- 
ment for engineers with a philosophy similar to ours but with an emphasis 
on continuous time rather than discrete time random processes. 

Another general text of interest is the inexpensive paperback book by 
Sveshnikov [53], which contains a wealth of problems in most of the topics 
covered here as well as many others. While the notation and viewpoint often 
differ, this book is a useful source of applications, formulas, and general 
tidbits. 

The set theory preliminaries of chapter A can be found in most any book 
on probability elementary or otherwise or in most any book on elementary 
real analysis. In addition to the general books mentioned, more detailed 
treatments can be found in books on mathematical analysis such as those 
by Rudin [50], Royden [48], and Simmons [51]. These references also con- 
tain discussions of functions or mappings. A less mathematical text that 
treats set theory and provides an excellent introduction to basic applied 
probability is Drake [12]. 

The linear systems fundamentals are typical of most electrical engineer- 
ing linear systems courses. Good developments may be found in Chen [7], 
Kailath [31], Bose and Stevens [4], and Papoulis [44], among others. A 
treatment emphasizing discrete time may be found in Stieglitz [52] . A min- 
imal treatment of the linear systems aspects used in this book may also be 
found in Gray and Goodman [23]. 

Detailed treatments of Fourier techniques may be found in Bracewell [5], 
Papoulis [43], Gray and Goodman [23], and the early classic Wiener [55]. 
This background is useful both for the system theory applications and for 
the manipulation of characteristic functions of moment-generating func- 
tions of probability distributions. 

Although the development of probability theory is self-contained, ele- 
mentary probability is best viewed as a prerequisite. An introductory text 
on the subject for review (or for the brave attempting the course with- 
out such experience) can be a useful source of intuition, applications, and 
practice of some of the basic ideas. Two books that admirably fill this func- 
tion are Drake [12] and the classic introductory text by two of the primary 
contributors to the early development of probability theory, Gnedenko and 
Khinchin [20]. The more complete text by Gnedenko [19] also provides a 
useful backup text. A virtual encyclopedia of basic probability, including 
a wealth of examples, distributions, and computations, may be found in 
Feller [15]. 

The axiomatic foundations of probability theory presented in chapter 
2 were developed by Kolmogorov and first published in 1933. (See the 
English translation [34].) Although not the only theory of probability (see. 
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e.g., Fine [16] for a survey of other approaches), it has become the standard 
approach to the analysis of random systems. The general references cited 
previously provide good additional material for the basic development of 
probability spaces, measures, Lebesgue integration, and expectation. The 
reader interested in probing more deeply into the mathematics is referred 
to the classics by Halmos [27] and Loeve [37]. 

As observed in chapter 4, instead of beginning with axioms of probabil- 
ity and deriving the properties of expectation, one can go the other way and 
begin with axioms of expectation or integration and derive the properties of 
probability. Some texts treat measure and integration theory in this order, 
e.g., Asplund and Bungart [2]. A nice paperback book treating probabil- 
ity and random processes from this viewpoint in a manner accessible for 
engineers is that by Whittle [54]. 

A detailed and quite general development of the Kolmogorov extension 
theorem of chapter 3 may be found in Parthasarathy [45], who treats prob- 
ability theory for general metric spaces instead of just Euclidean spaces. 
The mathematical level of this book is high, though, and the going can be 
rough. It is useful, however, as a reference for very general results of this 
variety and for detailed statements of the theorem. A treatment may also 
be found in Gray [22]. 

Good background reading for chapters 4 and 6 are the book on conver- 
gence of random variables by Lukacs [38] and the book on ergodic theory 
by Billingsley [3]. The Billingsley book is a real gem for engineers inter- 
ested in learning more about the varieties and proofs of ergodic theorems 
for discrete time processes. The book also provides nice tutorial reviews 
on advanced conditional probability and a variety of other topics. Several 
proofs are given for the mean and pointwise ergodic theorems. Most are 
accessible given a knowledge of the material of this book plus a knowledge 
of the projection theorem of Hilbert space theory. The book also provides 
insight into applications of the general formulation of ergodic theory to 
areas other than random process theory. Another nice survey of ergodic 
theory is that of Halmos [28] . 

As discussed in chapter 6, stationarity and ergodicity are sufficient but 
not necessary conditions for the ergodic theorem to hold, that is, for sample 
averages to converge. A natural question, then, is what conditions are both 
necessary and sufficient. The answer is know for discrete time processes in 
the following sense: A process is said to be asymptotically mean stationary 
or a.m.s. if is process distribution, say m, is such that the limits 

- n— 1 

lim - Vto(T-*F) 

n— ^oo Tl * ^ 

2=0 

exist for all process events F, where T is the left-shift operation. The limits 
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trivially exist if the process is stationary. They also exist when they die 
out with time and in a variety of other cases. It is known that a process 
will have an ergodic theorem in the sense of having all sample averages of 
bounded measurements converge if any only if the process is a.m.s. [24, 22]. 
The sample averages of an a.m.s. process will converge to constants with 
probability one if and only the process is also ergodic. 

Second-order theory of random processes and its application to filtering 
and estimation form a bread-and-butter topic for engineering applications 
and are the subject of numerous good books such as Grenander and Rosen- 
blatt [25], Cramer and Leadbetter [10], Rozanov [49], Yaglom [59], and 
Lipster and Shiryayev [36]. It was pointed out that the theory of weakly 
stationary processes is intimately related to the theory of Toeplitz forms 
and Toeplitz matrices. An excellent treatment of the topic and its applica- 
tions to random processes is given by Grenander and Szego [26]. A more 
informal engineering-oriented treatment of Toeplitz matrices can be found 
in Gray [21] 

It is emphasized in our book that the focus is on discrete time random 
processes because of their simplicity. While many of the basic ideas gen- 
eralize, the details can become far more complicated, and much additional 
mathematical power becomes required. For example, the simple product 
sigma fields used here to generate process events are not sufficiently large 
to be useful. A simple integral of the process over a finite time window 
will not be measurable with respect to the resulting event spaces. Most 
of the added difficulties are technical — that is, the natural analogs to 
the discrete time results may hold, but the technical details of their proof 
can be far more complicated. Many excellent texts emphasizing continuous 
time random processes are available, but most require a solid foundation 
in functional analysis and in measure and integration theory. Perhaps the 
most famous and complete treatment is that of Doob [11]. Several of the 
references for second-order theory focus on continuous time random pro- 
cesses, as do Gikhman and Skorokhod [18], Hida [29], and McKean [40]. 
Lamperti [35] presents a clear summary of many facets of continuous time 
and discrete time random processes, including second-order theory, ergodic 
theorems, and prediction theory. 

In chapter 5 we briefly sketched some basic ideas of Wiener and Kalman 
filters as an application of second-order theory. A detailed general devel- 
opment of the fundamentals and recent results in this area may be found 
in Kailath [32] and the references listed therein. In particular, the classic 
development of Wiener [56] is an excellent treatment of the fundamentals 
of Wiener filtering. 

Of the menagerie of processes considered in the book, most may be 
found in the various references already mentioned. The communication 
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modulation examples may also be found in Gagliardi [17], among others. 
Compound Poisson processes are treated in detail in Parzen [46]. There 
is an extensive literature on Markov processes and their applications, as 
examples we cite Kemeny and Snell [33], Chung [8], Rosenblatt [47], and 
Dynkin [14]. 

Perhaps the most notable beast absent from our menagerie of processes 
is the class of Martingales. Had the book and the target class length been 
longer. Martingales would have been the next topic to be added. They 
were not included simply because we felt the current content already filled 
a semester, and we did not want to expand the book past that goal. An 
excellent mathematical treatment for the discrete time case may be found 
in Neveu [41], and a readable description of the applications of Martingale 
theory to gambling may be found in the classic by Dubins and Savage [13]. 




434 



APPENDIX D. SUPPLEMENTARY READING 




Bibliography 



[1] R. B. Ash. Real Analysis and Probability. Academic Press, New York, 
1972. 

[2] E. Asplund and L. Bungart. A First Course in Integration. 
Holt, Rinehart and Winston, New York, 1966. 

[3] P. Billingsley. Ergodic Theory and Information. Wiley, New York, 
1965. 

[4] A. G. Bose and K. N. Stevens. Introductory Network Theory. Harper 
& Row, New York, 1965. 

[5] R. Bracewell. The Fourier Transform and Its Applications. McGraw- 
Hill, New York, 1965. 

[6] L. Breiman. Probability. Addison- Wesley, Menlo Park, GA, 1968. 

[7] G. T. Ghen. Introduction to Linear System Theory. Holt, Rinehart 
and Winston, New York, 1970. 

[8] K. L. Ghung. Markov Chains with Stationary Transition Probabilities. 
Springer- Ver lag. New York, 1967. 

[9] K. L. Ghung. A Course in Probability Theory. Academic Press, New 
York, 1974. 

[10] H. Gramer and M. R. Leadbetter. Stationary and Related Stochastic 
Processes. Wiley, New York, 1967. 

[11] J. L. Doob. Stochastic Processes. Wiley, New York, 1953. 

[12] A. W. Drake. Fundamentals of Applied Probability Theory. McGraw- 
Hill, San Francisco, 1967. 

[13] L. E. Dubins and L. J. Savage. Inequalities for Stochastic Processes: 
How to Camble If You Must. Dover, New York, 1976. 



435 




436 



BIBLIOGRAPHY 



[14] E. B. Dynkin. Markov Processes. Springer- Verlag, New York, 1965. 

[15] W. Feller. An Introduction to Probability Theory and its Applications, 
volume 2. Wiley, New York, 1960. 3rd ed. 

[16] T. Fine. Properties of an optimal digital system and applications. 
IEEE Trans. Inform. Theory, 10:287-296, Oct 1964. 

[17] R. Gagliardi. Introduction to Communications Engineering. Wiley, 
New York, 1978. 

[18] I. I. Gikhman and A. V. Skorokhod. Introduction to the Theory of 
Random Processes. Saunders, Philadelphia, 1965. 

[19] B. V. Gnedenko. The Theory of Probability. Ghelsea, New York, 1963. 
Translated from the Russian by B. D. Seckler. 

[20] B. V. Gnedenko and A. Ya. Khinchine. An Elementary Introduction 
to the Theory of Probability. Dover, New York, 1962. Translated from 
the 5th Russian edition by L. F. Boron. 

[21] R. M. Gray. Toeplitz and circulent matrices: II. ISL technical re- 
port no. 6504-1, Stanford University Information Systems Laboratory, 
April 1977. (Available by anonymous ftp to isl . Stanford. edu in the 
directory tt pub/gray/reports/ toeplitz or via the World Wide Web at 
http : //www-isl . Stanford . edu/ 'gray/ compression . html. ) . 

[22] R. M. Gray. Probability, Random Processes, and Ergodic Properties. 
Springer- Verlag, New York, 1988. 

[23] R. M. Gray and J. G. Goodman. Fourier Transforms. Kluwer Aca- 
demic Publishers, Boston, Mass., 1995. 

[24] R. M. Gray and J. G. Kieffer. Asymptotically mean stationary mea- 
sures. Ann. Probab., 8:962-973, 1980. 

[25] U. Grenander and M. Rosenblatt. Statistical Analysis of Stationary 
Time Series. Wiley, New York, 1957. 

[26] U. Grenander and G. Szego. Toeplitz Forms and Their Applications. 
University of Galifornia Press, Berkeley and Los Angeles, 1958. 

[27] P. R. Halmos. Measure Theory. Van Nostrand Reinhold, New York, 
1950. 

[28] P. R. Halmos. Lectures on Ergodic Theory. Ghelsea, New York, 1956. 




BIBLIOGRAPHY 



437 



[29] T. Hida. Stationary Stochastic Processes. Princeton University Press, 
Princeton, NJ, 1970. 

[30] D. Huff and I. Geis. How to Take a Chance. W. W. Norton, New York, 
1959. 

[31] T. Kailath. Linear Systems. Prentice-Hall, Englewood Cliffs, NJ, 1980. 

[32] T. Kailath. Lectures on Wiener and Kalman Filtering. CISM Courses 
and Lectures No. 140. Springer- Verlag, New York, 1981. 

[33] J. G. Kemeny and J. L. Snell. Finite Markov Chains. D. Van Nostrand, 
Princeton, NJ, 1960. 

[34] A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, 
New York, 1950. 

[35] J. Lamperti. Stochastic Processes: A Survey of the Mathematical The- 
ory. Springer- Verlag, New York, 1977. 

[36] R. S. Liptser and A. N. Shiryayev. Statistics of Random Processes. 
Springer- Verlag, New York, 1977. Translated by A. B. Aries. 

[37] M. Loeve. Probability Theory. D. Van Nostrand, Princeton, NJ, 1963. 
Third Edition. 

[38] E. Lukacs. Stochastic Convergence. Heath, Lexington, MA, 1968. 

[39] L. E. Maistrov. Probability Theory: A Historical Sketch. Academic 
Press, New York, 1974. Translated by S. Kotz. 

[40] H. P. McKean, Jr. Stochastic Integrals. Academic Press, New York, 
1969. 

[41] J. Neveu. Discrete-Parameter Martingales. North-Holland, New York, 
1975. Translated by T. P. Speed. 

[42] J. R. Newman. The World of Mathematics, volume 3. Simon & Schus- 
ter, New York, 1956. 

[43] A. Papoulis. The Fourier Integral and Its Applications. McGraw-Hill, 
New York, 1962. 

[44] A. Papoulis. Signal Analysis. McGraw-Hill, New York, 1977. 

[45] K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic 
Press, New York, 1967. 




438 



BIBLIOGRAPHY 



[46] E. Parzen. Stochastic Processes. Holden Day, San Francisco, 1962. 

[47] M. Rosenblatt. Markov Processes: Structure and Asymptotic Behavior. 
Springer- Verlag, New York, 1971. 

[48] H. L. Royden. Real Analysis. Macmillan, London, 1968. 

[49] Yu. A. Rozanov. Stationary Random Processes. Holden Day, San 
Francisco, 1967. Translated by A. Feinstein. 

[50] W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, New 
York, 1964. 

[51] G. F. Simmons. Introduction to Topology and Modern Analysis. 
McGraw-Hill, New York, 1963. 

[52] K. Steiglitz. An Introduction to Discrete Systems. Wiley, New York, 
1974. 

[53] A. A. Sveshnikov. Problems in Probability Theory, Mathematical 
Statistics, and Theory of Random Functions. Dover, New York, 1968. 

[54] P. Whittle. Probability. Penguin Books, Middlesex,England, 1970. 

[55] N. Wiener. The Fourier Integral and Certain of Its Applications. Gam- 
bridge University Press, New York, 1933. 

[56] N. Wiener. Time Series: Extrapolation, Interpolation, and Smoothing of 
Stationary Time Series with Engineering Applications. M. I. T. Press, 
Gambridge, MA, 1966. 

[57] N. Wiener and R.E.A.G. Paley. Fourier Transforms in the Complex 
Domain. Am. Math. Soc. Goll. Pub., Providence, RI, 1934. 

[58] E. Wong. Introduction to Random Processes. Springer-Verlag, New 
York, 1983. 

[59] A. M. Yaglom. An Introduction to the Theory of Stationary Random 
Functions. Prentice-Hall, Englewood Gliffs, NJ, 1962. Translated by 
R. A. Silverman. 




Index 



<I> function, 64 
S response, 102 

a.m.s., 431 
abstract space, 389 
additive 

finite, 42 
additivity, 18, 42 
countable, 43 
finite, 18 
affine, 226 
algebra, 24 
alphabet, 104 

continuous, 116 
discrete, 116 
mixed, 116 
amplitude 

continuous, 116 
discrete, 116 
area, 11 

ARMA random process, 348 
asymptotically mean stationary, 431 
asymptotically uncorrelated, 256, 
258 

autocorrelation matrix, 227 
autoregressive, 160, 295 
autoregressive filter, 346 
autoregressive random process, 348 
average 

probabilistic, 47 
statistical, 47 
axioms, 18 

axioms of probability, 25 



Balakrishnan 
A.V., 305 
Bayes risk, 136 
Bayes’ rule, 131, 133, 136 
Bernoulli process, 94, 158 
binomial, 427 
binomial coefficient, 427 
Binomial counting process, 162, 
349 

binomial counting process, 163 
bit, 184 

Bonferoni inequality, 78 
Borel field, 37 
Borel sets, 37 
Borel space, 56 
Borel-Cantelli lemma, 247 

categorical, 390 

Cauchy-Schwarz inequality, 206 

causal, 410 

cdf, 81, 107, 119 

central limit theorem, 199, 235 

chain rule, 131, 162 

channel 

noisy, 135 

characteristic function, 148, 150, 
197 

Chernoff inequality, 249 
chi-squared, 113 
collectively exhaustive, 401 
complement, 394 
complementation, 394 
complete, 77 



439 




440 



INDEX 



complete the square, 126, 140, 152 
completion, 77 

conditional differential entropy, 270 
conditional expectation, 210 
conditional mean, 142 
conditional pmf, 130 
conditional probability, 71 
nonelementary, 168, 169 
conditional variance, 142 
consistency, 16, 121 
continuity, 43 
continuity from above, 45 
continuity from below, 45 
continuous time, 116 
convergence 

almost everywhere, 240 
almost surely, 240 
pointwise, 240 
w.p. 1, 240 

with probability one, 240 
convergence in distribution, 236 
convergence in mean square, 240 
convergence in probability, 240 
convolution, 406 
discrete, 138 
modulo 2, 138 
sum, 138 

coordinate function, 100 
correlation, 203 
correlation coefficient, 133 
countable, 400 
counting process, 163 
covariance, 205 
cross-correlation, 317 
cross-covariance, 214 
cross-spectral density, 317 
cumalitive distribitution function, 
119 

cumulative distribution function, 
81, 107 

decision rule, 135 



decreasing sets, 33 
DeMorgan’s law, 399 
density 

mass, 11 
dependent, 92 
derived distribution, 21, 88 
detection, 135 
difference 

symmetric, 396 
differential entropy, 209 
Dirac delta, 66 
directly given, 22 
discrete spaces, 28 
discrete time, 116 
disjoint, 394, 401 
distance, 78 

distribution, 87, 105, 117 
convergence in, 236 
joint, 122 
marginal, 122 
domain, 401 
domain of definition, 21 
dominated convergence theorem, 
202 

dot product, 403 
doubly exponential, 428 
doubly stochastic, 375 

eigenvalue, 404 

eigenvector, 404 

element, 389 

elementary events, 23 

elementary outcomes, 23 

empty set, 393 

eq:2ndorder, 252 

equivalent random variables, 89 

ergodic decomposition, 376 

ergodic theorem, 190 

ergodic theorems, 187 

ergodicity, 373 

error, mean squared, 216 

estimate 




INDEX 



441 



minimum mean squared er- 
ror, 219 
estimation, 146 

maximum a posteriori, 147 
event, 23 

event space, 12, 23, 31 
trivial, 27 
events 

elementary, 23 
expectation, 46, 56, 187, 189 
conditional, 210 
fundamental theorem of, 195 
iterated, 211 
nested, 211 
expected value, 190 
experiment, 12, 26 
exponential, 428 

field, 24 
filter 

autoregressive, 346 
linear, 406 
moving average, 344 
transversal, 345 
FIR, 344 

Fourier transform, 148, 151, 308 
function, 401 
identity, 47 
measurable, 98 

fundamental theorem of expecta- 
tion, 195 

Gamma, 428 
Gaussian, 158, 428 
jointly, 213 

Gaussian random vectors, 152 
geometric, 427 

Hamming weight, 55, 160 

hard limiter, 99 

hidden Markov model, 167 

identically distributed, 89 



identity function, 47 
identity mapping, 88 
iid, 94, 128, 158 
HR, 344 
image, 402 

impulse response, 406 
increasing sets, 32 
increments, 351 

independent, 351 
stationary, 351 
independence, 70, 127 
linear, 204 
independent, 127 
independent and stationary incre- 
ments, 350 

independent identically distributed, 
94, 128 

independent increments, 351 
independent random variables, 127 
indicator function, 46, 196 
induction, 50 
inequality 

Tchebychev, 242 
infinite 

countably, 400 
inner product, 403 
integral 

Lebesgue, 58, 423 
intersection, 394 
interval, 393 

closed, XV, 393 
half-closed, 393 
half-open, 393 
open, XV, 393 
inverse image, 87, 402 
inverse image formula, 88 
isi, 351 

iterated expectation, 211 

Jacobian, 114 
joint distribution, 122 
jointly Gaussian, 153 




442 



INDEX 



Kronecker delta response, 102 
Kronecker delta response, 408 

Laplace transform, 151 
Laplacian, 428 

law of large numbers, 187, 190 
Lebesgue integral, 423 
linear, 406 
linear models, 348 
logistic, 428 

MAP, 136 

MAP estimation, 147 
mapping, 401 
marginal pmf, 91 
Markov chain, 162 
Markov inequality, 242 
Markov process, 162, 165, 167 
mass, 11 
matrix, 403 

maximum a posteriori, 136 
maximum a posteriori estimation, 
147 

maximum likelihood estimation, 147 
mean, 47, 57, 190 
conditional, 142 
mean function, 158 
mean squared error, 216, 218 
mean vector, 206 
measurable, 98 
measurable space, 25 
measure 

probability, 42 
measure theory, 11 
memoryless, 159 
metric, 78 

minimum distance, 145 
mixture, 69, 375 
ML estimator, 147 
MMSE, 217 

modulo 2 arithmetic, 134 
moment, 47, 57 



centralized, 194 
second, 206 

moment generating function, 198 
moments, 194 

monotone convergence theorem, 202 
moving average, 275, 295, 344 
moving average filter, 344 
moving-average random process, 
348 

MSE, 216 

mutually exclusive, 394 
mutually independent, 94 

nested expectation, 211 
noise 

white, 302 
noisy channel, 135 
nonempty, 390 
nonnegative definite, 405 
numeric, 390 

one-sided, 29, 405 
one-step prediction, 221 
one-to-one, 402 
onto, 402 
operation, 146 
orthogonal, 230 

orthogonality principle, 226, 230, 
231 

outer product, 404 

Paley- Wiener criteria, 304 
partition, 401 
pdf, 17, 61 

fc-dimensional, 68 
chi-squared, 113 
doubly exponential, 62, 428 
elementary conditional, 74 
exponential, 62, 428 
Gamma, 428 
Gaussian, 62, 428 
Laplacian, 62, 428 
logistic, 428 




INDEX 



443 



Rayleigh, 428 
uniform, 17, 61, 428 
Weibull, 428 
Phi function, 64 
pmf, 20, 48 

binary, 48, 427 
binomial, 48, 427 
conditional, 73, 130 
Poisson, 428 
product, 124 
uniform, 48, 427 
pmbgeometric, 427 
point, 389 

pointwise convergence, 240 
Poisson, 428 

Poisson counting process, 350, 362 
positive definite, 405 
power set, 35 

power spectral density, 289, 291 
prediction 

one-step, 221 
predictor 

one-step, 219 
optimal, 218 
preimage, 402 
probabilistic average, 190 
probability 

a posteriori, 71 
a priori, 71 
conditional, 71 
unconditional, 71 
probability density function, 17, 
61 

probability mass function, 20, 48 
probability measure, 12, 18, 25, 
42 

probability of error, 135 
probability space, 11, 12, 23, 26 
complete, 77 
trivial, 26 

probability theory, 11 
product pmf, 124 



product space, 392 
projection, 100 

quantizer, 21, 99, 412, 424 

random object, 85 
random process, 93, 115 
ARMA, 348 
autoregressive, 348 
Bernoulli, 158 
counting, 163 
Gaussian, 158 
iid, 158 
isi, 351 
Markov, 167 
moving-average, 348 
random variable, 21, 46, 56, 85, 
87 

continuous, 110 
discrete, 110 
Gaussian, 98 
mixture, 110 
random variable, 97 
random variables 
equivalent, 89 
independent, 127 
random vector, 89, 90, 115 
Gaussian, 152 
random walk, 275 
range, 21, 401 
range space, 402 
Rayleigh, 428 
rectangles, 41 
regression, 146 
regression coefficients, 346 
relative frequency, 189 

sample autocorrelation, 293 
sample points, 23 
sample space, 12, 23 
sampling function, 100 
sequence space, 29 
set, 392 




444 



INDEX 



empty, 393 
one-point, 393 
singleton, 393 
universal, 389 
set difference, 396 
set theory, 389 
shift, 251 
sigma-algebra, 23 
sigma-field, 12, 23 
sigma-field 

generated, 37 
signal, 14, 405 

continuous time, 30 
discrete time, 29 
signal processing, 14, 21, 46 
simple function, 424 
single-sided, 405 
space, 389 

empty, 390 
product, 392 
trivial, 390 
spectrum, 308 
stable, 283, 407, 409 
standard deviation, 206 
stationarity property, 251 
stationary, 253-255 
first order, 251 
strict sense, 253 
strictly, 253 
weakly, 252 

stationary increments, 351 
Stieltjes ingegral, 108 
stochastic process, 116 
subset, 392 
superposition, 406 
symmetric, 403 
symmetric difference, 396 
system, 405 

continuous time, 405 
discrete time, 405 
linear, 406 



tapped delay line, 345 
Tchebychev inequality, 242 
telescoping sum, 44 
threshold detector, 145 
time 

continuous, 116 
discrete, 116 
time series, 116 
time-invariant, 406 
Toeplitz, 252 
Toeplitz matrix, 420 
transform 

Laplace, 151 
transversal filter, 345 
trivial probability space, 26 
two-sided, 29, 406 

uncorrelated, 204 

asymptotically, 256, 258 
uncountable, 400 
uniform, 427 
union, 394 
union bound, 78 
unit impulse, 66 
unit pulse response, 102 
universal set, 389 
univorm, 428 

variance, 47, 57, 194, 206 
conditional, 142 
vector, 402 

random, 115 
volume, 11 

weakly stationary, 252 
Weibull, 428 
weight, 11 
white noise, 302 
Wiener process, 297, 349, 350 
discrete time, 166 




